Wednesday, May 05, 2010

K-Nearest Neighbor in SAS

K-Nearest-Neighbor, aka KNN, is a widely used data mining tool and is often called memory-based/case-based/instance-based method as no model is fit. A good introduction to KNN can be find at [1], or @ Wiki.

Typically, KNN algorithm relies on a sophisticated data structure called kd-Tree [2] to quickly find the cloeset points for a given point in high dimensional space. While you can find good pseudo-code for kd-Tree implementation and KNN online everywhere, for example [3], it is not trivial to implement your own in SAS [I mean an efficient one]. For most analysts, the first idea is to turn to your $200K-Initial-Fee-AND-$60K-Per-Year Enerprise Miner(R) for this method, however, it turns out that PROC DISCRIM in SAS/STAT is able to do the same thing! Note that annual license fee for SAS/STAT is a tiny fraction of EM and most analysts have ready access to SAS/STAT.

Simply ask PROC DISCRIM to use nonparametric method by using option "METHOD=NPAR K=". Note that do not use "R=" option at the same time, which corresponds to radius-based of nearest-neighbor method. Also pay attention to how PROC DISCRIM treat categorical data automatically. Sometimes, you may want to change categorical data into metric coordinates in advance.

Because KNN is a memory-based method, when you score the test data or new data in production, you have to use the raw table in the scoring process. Test different values of K using Cross-Validation to select the best one [you need a macro loop then.]

PROC DISCRIM uses memory approximately proportional to the second order of number of variables and time usage, excluding I/O time, is roughly proportional to log(N)*(N*P) where N is the number of observations and P is the number of variables used as it uses the tree search algorithm in [4].

Therefore, time consumption is still a concern on large data set. As an experiment, I conducted KNN on data set with records ranging from 5000 to 50000 while scoring the same amount of records at the same time, using 20 features and let k=11, we observe:

obs        time (sec)     memory (KB)
5000        1.03            1068
10000      3.90            1949
15000      8.65            2830
20000    15.48            3709
25000    34.58            4587
30000    70.26            5472
35000  109.83            6350
40000  161.71            7229
45000  208.80            8108
50000  263.58            8987

Empirical time usage is roughly quadratic in the multiplier of per 5000 observations, which means to work on a data set with 300K observations and score the same number of records, SAS needs to take 3.75 hours! Large K value will only increase time used by a very small fraction, though.
Sample code using Iris data from SAS Online Doc for v9.1.3:


ods select none;
proc surveyselect data=iris  out=iris2  
                  samprate=0.5  method=srs  outall;
run;
ods select all;

%let k=5;
proc discrim data=iris2(where=(selected=1))   
             test=iris2(where=(selected=0))
             testout=iris2testout
             method=NPAR k=&k 
             listerr crosslisterr; 
      class Species; 
      var SepalLength SepalWidth PetalLength PetalWidth; 
      title2 'Using KNN on Iris Data'; 
run; 

proc freq data=iris2testout;
     table Species*_INTO_;
run;

Reference:
[1]. Trevor Hastie, Robert Tibshirani, Jerome Friedman, "Elements of Statistical Learning", Ed.2, Springer, 2008;
[2]. J. L. Bentley. "Multidimensional binary search trees used for associative searching", Communications of the ACM, 18(9):509-517, 1975;
[3]. http://simsearch.yury.name/tutorial.html ;
[4]. Friedman, J.H., Bentley, J.L., and Finkel, R.A., "An Algorithm for Finding Best Matches in Logarithmic Expected Time," ACM Transactions on Mathematical Software, 3, 209 - 226. 1977;

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)

Next Project: Regularized Logistic Regression

L1 Regularized Logistic Regression effectively handles large number of predictors and serves variable selection simultaneously. [1] indicates that L1 RLR can be implemented via IRLS-LARS algorithm. You can tweak PROC GLMSELECT in v9.2 for this.

L2 Reguarlized Logistic Regression can be used to approximate SVM solutions [2], and can be implemented via TR-IRLS as suggested by [3], which is a ridge LR.

Reference:
[1]Su-In Lee, Honglak Lee, Pieter Abbeel and Andrew Y. Ng, "Efficient L1 Regularized Logistic Regression", Proceedings of Annual Conference of American Association for Artificial Intelligence, 2006

[2] Jian Zhang, Rong Jin, Yiming Yang & Alex G. Hauptmann, "Modified Logistic Regression: An Approximation to SVM and Its Applications in Large-Scale Text Categorization", Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.

[3]Paul Komarek, "Logistic Regression for Data Mining and High-Dimensional Classification", Ph.D Dissertation, Robotics Institute, Carnegie Mellon University, 2004