Wednesday, May 08, 2013

Finding the closest pair in datat using PROC MODECLUS

 
 
UPDATE: Rick Wicklin kindly shared his visualization efforts on the output to put a more straightforward sense on the results. Thanks. Here is the code, run after my code below. Note that this is designed for K=2.
 

proc iml;
use out;      read all var {ID x y}; close out;
use neighbor; read all var {Nbor}; close neighbor;
NBor = num(Nbor);         /* convert to numeric */
xEnd = x[Nbor];
yEnd = y[Nbor];
create Viz var {ID x y xEnd yEnd}; append; close Viz;
quit;
proc sgplot data=Viz noautolegend;
        scatter y=y x=x / datalabel=id;
        vector x=xEnd y=yEnd / xorigin=x yOrigin=y;
run;

***************************************************************************



More often than not, the analyst wants to find the pair of observations that are the closest in terms of certain metrics, such as Euclidean distance. For example, on the popular SAS-L archive group, Randall Powers posted such a question at here, but he was using PROC DISTANCE.

His idea is straightforward: First calculate the pairwise distance among all observations, and Second, find the one that has the shortest distance. For the first step, he was trying to use PROC DISTANCE, then then he planned to use DATA STEP to search through. This is not workable in even the modest data size, say 100K observations, because PROC DISTANCE will generate a dense double precision 100K-by-100K matrix and will blow up your workstation. That is the reason Randall turns to SAS-L for help.

Well, SAS actually has an old PROC that can readily solve his problem. Of course, before a solution can be crafted, there are several issues that have to be clarified first.

First, what if some observations share the same closest observations, do you allow shared case or you need a mutually exclusive solution that each observation only has its unique non-shared counterpart as the closest point, then how would you determine which one should take the shared observation? First come first serve or any other rule?

Second, when we talk about pairs, we know N should be even number, but what if the data we have has an odd number for N? How to deal with the left over?

That being said, let's consider the simplest and most tolerate case, that is each observation is allowed to serve as the closest point to multiple other observations and N is an even number.

The solution uses PROC MODECLUS. This is an old procedure for density based clustering analysis, but its build-in nearest neighbor search algorithm and output capability make it the perfect candidate for this type of job. We use the example data in the "Getting Started" section.

The ODS OUTPUT statement directly outputs the list of nearest neighbors (closest points) for each observation and your have to specify either ALL or NEIGHBOR option in the PROC MODECLUS statement in order to use this functionality. In the same statement, we also specify K=2. K=2 means 2 nearest neighbor but since this applies for all observations, for each one, the nearest neighbor is K-1=1. So if you specify K=3, you actually ask this procedure find 2 closest point for each 1 observation at hand.

The good thing about the nearest neighbor list data set is that it also contains the calculated distance, therefore, in cases you need to deal with more complex situation as listed above, such as non-shared nearest neighbor, etc, you need some room for post processing. Here you can specify K=3 and PROC MODECLUS will output 2 neighbors for each observation, like below.


You can see that observation 3 can be matched to either observation 1 or 2, but with 2, it yields the shortest distance, and if this is the rule you are going to apply, your post process can work on this data set to implement the rule.

Please feel free to verify the result. If you find error, let me know or post in the comment.

 
data example;
      input x y @@;
   ID=_n_;
      datalines;
   
   18 18  20 22  21 20  12 23  17 12  23 25  25 20  16 27
   20 13  28 22  80 20  75 19  77 23  81 26  55 21  64 24
   72 26  70 35  75 30  78 42  18 52  27 57  41 61  48 64
   59 72  69 72  80 80  31 53  51 69  72 81
   ;
run;

ods select none;
ods output neighbor=neighbor; /* ODS output dataset */
proc modeclus data=example method=0  k=2    /*Find a pair of nearest neighbor*/
                 out=out  
                 all  /* required option in order to output Nearest Neighbors*/;
     var x y;
  id  id;
run;
ods select all;


No comments: