My paper "%HPGLIMMIX: A High-Performance SAS Macro for GLMM Estimation" is now available at Journal of Statistical Software website @here.

SAS macro and code can also be found there. If you use it, please kindly send me an email so that I know my work is appreciated and is helpful to more people.

===========

Update:

The paper and the macro are well accepted: 400+ downloads for the paper and 90+ downloads for the macro for the first month it is online and I've got many good feedbacks. Several key points to make:

1. For binary dependent variable, you have to code your data into numeric {0, 1} and the macro will model the probability dependent variable having value 1. So if you have other requirements, you have to process the data before calling the macro.

2. This macro relies on PROC HPMIXED, so that it will generate two residual parameters by default, therefore for conditional distributions without scale parameters, such as Poisson, users need to hold the last residual variance to be 1 by using PARMS /HOLD= statement. Please see the explanation of second difference between this macro and PROC GLIMMIX on Page 9, as well as Example 2.

# SAS Programming for Data Mining

© Copyright 2006-2013 / SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.

## Tuesday, July 01, 2014

## Friday, December 06, 2013

### Market trend in advanced analytics for SAS, R and Python

Disclaimer: This study is a view on the market trend on demand of advanced analytics software and their adoptions from the job market perspective, and should not be read as a conclusive statement on what is all happening there. The findings should be combined with other third party research results such IDC reports http://www.idc.com/getdoc.jsp?containerId=241689 to reach a balanced and comprehensive idea.

/* --------------------------------------------------------------------------------------------------------------------*/

The debate on the competition between SAS, R and Python for advanced analytics, Statistics or Data Science in general never ends. The question boils down to how far SAS, as a software and programming language for such purpose, can still be used by business in mass, given its current pricing structure.

Some recent discussions can be found at:

http://www.listserv.uga.edu/cgi-bin/wa?A2=ind1312a&L=sas-l&F=&S=&P=1171

http://r4stats.com/2013/03/19/r-2012-growth-exceeds-sas-all-time-total/

http://r4stats.com/articles/popularity/

and more...

indeed.com has been used to address such question before by Robert Muenchen @ http://r4stats.com/articles/popularity/. His analysis on popularity of SAS and R in the job market using indeed.com was simple. The terms he used were just "R SAS" and "SAS !SATA !storage !firmware".

In my analysis using indeed trend, I directly search combination of languages such as R, SAS and Python with analytics related job terms, such as "R AND Analytics" or "R AND Regression", etc. The goal is to understand the market dynamics in adopting each of the three languages in advanced business analytics. I try to be as fair as possible in my analysis.

The job description related terms that I am going to combine with the language names fall into three categories:

1. Techniques typically used, here I use "Regression", which almost all advanced business analytics job will look for;

2. Industry, here I use "pharmaceutical", and I used "JPMorgan Chase" to represent financial service (even though it is not representative"

3. General term such as "data mining", "machine learning" and "statistics". In general, data mining has different implication from machine learning. The former is a more general term, or a modern term for "analytics" while the latter shows up momentum only recently and largely focus on computer science related field and more hardcore on algorithm development, etc, so favoring Python over the other two.

The graph below shows indeed.com job trends using search term "R and regression, SAS and regression, Python and regression". It has three immediate pieces of information to tell:

a.) Python has been picking up very fast since 09;

b.) The market share gap between SAS and R/Python is consistently dropping since 2006 and in foreseeable future the lead of SAS is going to disappear.

c.) SAS's market share reached high peak in 2011 but kept dropping thereafter while Python maintains steady trend and R is picking up in the same time period.

The below graph searches the trend using term "R and pharmaceutical", etc. I wanted to see how the trend of adopting those three languages in pharmaceutical industry. What we found out are three points:

a.) R and SAS almost show up at the same time in pharmaceutical jobs

b.) The relative job demand for analytics in pharmaceutical industry is declining.

c.) Python is on the rise even though still trailing far behind SAS and R.

When using "JPmorgan Chase" combined with languages for trend analysis, I specifically excluded "storage" from the SAS search term because the bank may hire some IT folks and result will mix with SAS storage system. The term is ' R and "jpmorgan chase", SAS not Storage and "jpmorgan chase", Python and "jpmorgan chase" '. There are several interesting observations:

a.) Python is picking up some job shares in JPmorgan Chase from 2010. R is losing;

b.) There is noticeable seasonality in hiring, mostly in summer time;

c.) Before 2008, there was strong demand for analysts using SAS, then the crisis came and hiring stalled for almost 2 years. Beginning in 2011, the business is getting better and the demand for analysts using SAS is picking up all year around and attained peak in this year.

In our last part of trend analysis, we want to combine the languages with more general terms such as "analytics", "data mining" to see how the whole business perceives these three analytical languages.

We first combine with "data mining". Three points need to be noted:

1. The job market for analytics in general has been increasing since 2009 and all languages shows good momentum from 2009 to 2010;

2. SAS was the dominant player in this business, but the landscape is changing rapidly. Both R and Python see almost identical adoption by the business. There are some minimal gap between these two but I would consider it not significant;

3. SAS is not as favored by the market as before and its share not only stalled but also showed some declining trend this year. This is very alarming.

Now, let's be more focused. First, we study "Statistics". The message is not good for SAS here. SAS used to be THE

Turning to "machine learning", Python is the leader, followed by R. SAS sees some distance here. The seasonality of Python is a little different from R and SAS while the latter two have very close seasonal dynamics. The hypothesis here is that Python is dominantly used in different industries while R and SAS share similar industries.

/* --------------------------------------------------------------------------------------------------------------------*/

The debate on the competition between SAS, R and Python for advanced analytics, Statistics or Data Science in general never ends. The question boils down to how far SAS, as a software and programming language for such purpose, can still be used by business in mass, given its current pricing structure.

Some recent discussions can be found at:

http://www.listserv.uga.edu/cgi-bin/wa?A2=ind1312a&L=sas-l&F=&S=&P=1171

http://r4stats.com/2013/03/19/r-2012-growth-exceeds-sas-all-time-total/

http://r4stats.com/articles/popularity/

and more...

indeed.com has been used to address such question before by Robert Muenchen @ http://r4stats.com/articles/popularity/. His analysis on popularity of SAS and R in the job market using indeed.com was simple. The terms he used were just "R SAS" and "SAS !SATA !storage !firmware".

In my analysis using indeed trend, I directly search combination of languages such as R, SAS and Python with analytics related job terms, such as "R AND Analytics" or "R AND Regression", etc. The goal is to understand the market dynamics in adopting each of the three languages in advanced business analytics. I try to be as fair as possible in my analysis.

The job description related terms that I am going to combine with the language names fall into three categories:

1. Techniques typically used, here I use "Regression", which almost all advanced business analytics job will look for;

2. Industry, here I use "pharmaceutical", and I used "JPMorgan Chase" to represent financial service (even though it is not representative"

3. General term such as "data mining", "machine learning" and "statistics". In general, data mining has different implication from machine learning. The former is a more general term, or a modern term for "analytics" while the latter shows up momentum only recently and largely focus on computer science related field and more hardcore on algorithm development, etc, so favoring Python over the other two.

The graph below shows indeed.com job trends using search term "R and regression, SAS and regression, Python and regression". It has three immediate pieces of information to tell:

a.) Python has been picking up very fast since 09;

b.) The market share gap between SAS and R/Python is consistently dropping since 2006 and in foreseeable future the lead of SAS is going to disappear.

c.) SAS's market share reached high peak in 2011 but kept dropping thereafter while Python maintains steady trend and R is picking up in the same time period.

The below graph searches the trend using term "R and pharmaceutical", etc. I wanted to see how the trend of adopting those three languages in pharmaceutical industry. What we found out are three points:

a.) R and SAS almost show up at the same time in pharmaceutical jobs

b.) The relative job demand for analytics in pharmaceutical industry is declining.

c.) Python is on the rise even though still trailing far behind SAS and R.

When using "JPmorgan Chase" combined with languages for trend analysis, I specifically excluded "storage" from the SAS search term because the bank may hire some IT folks and result will mix with SAS storage system. The term is ' R and "jpmorgan chase", SAS not Storage and "jpmorgan chase", Python and "jpmorgan chase" '. There are several interesting observations:

a.) Python is picking up some job shares in JPmorgan Chase from 2010. R is losing;

b.) There is noticeable seasonality in hiring, mostly in summer time;

c.) Before 2008, there was strong demand for analysts using SAS, then the crisis came and hiring stalled for almost 2 years. Beginning in 2011, the business is getting better and the demand for analysts using SAS is picking up all year around and attained peak in this year.

In our last part of trend analysis, we want to combine the languages with more general terms such as "analytics", "data mining" to see how the whole business perceives these three analytical languages.

We first combine with "data mining". Three points need to be noted:

1. The job market for analytics in general has been increasing since 2009 and all languages shows good momentum from 2009 to 2010;

2. SAS was the dominant player in this business, but the landscape is changing rapidly. Both R and Python see almost identical adoption by the business. There are some minimal gap between these two but I would consider it not significant;

3. SAS is not as favored by the market as before and its share not only stalled but also showed some declining trend this year. This is very alarming.

Now, let's be more focused. First, we study "Statistics". The message is not good for SAS here. SAS used to be THE

*de facto*software for industrial statistical analysis. This fact is reflected by the trend before 2010. SAS is so dominant that its market share is more than 3X comparing to R and even more comparing to Python. But what's interesting is that while SAS saw some good days from 2010 to 2011, which also reflected in their annual revenue", its market share is on a steady down road from 2012. On the other hand, R and Python are still picking up without hesitation.Turning to "machine learning", Python is the leader, followed by R. SAS sees some distance here. The seasonality of Python is a little different from R and SAS while the latter two have very close seasonal dynamics. The hypothesis here is that Python is dominantly used in different industries while R and SAS share similar industries.

## Friday, July 19, 2013

### I don't always do regression, but when I do, I do it in SAS 9.4

There are several exciting add-ins from SAS Analytics products running on v9.4, especially the SAS/STAT high performance procedures, where "high performance" refers to either in single-machine multi-threading mode or full distributed mode.

**HPGENSELECT:**high performance model fitting and variable selection for GLM within standard exponential family;**HPLOGISTIC:**high performance logistic regression**HPLMIXED:**high performance Linear Mixed Model, unfortunately it still strictly a LMM, so does not support the DIST= option in MODEL statement**HPNLMOD:**high performance nonlinear LS or MLE**HPREG:**high performance for OLS, luckily HPREG supports CLASS statement and includes LAR/LASSO variable selection methods. On the other hand, HPREG does not supports best-subsets variable selection methods.**HPSPLIT:**high-performance utility procedure that creates a decision tree model and saves results in output data sets and files for use in SAS Enterprise Miner.## Wednesday, May 08, 2013

### Finding the closest pair in datat using PROC MODECLUS

**UPDATE:**Rick Wicklin kindly shared his visualization efforts on the output to put a more straightforward sense on the results. Thanks. Here is the code, run after my code below. Note that this is designed for K=2.

proc iml;

use out; read all var {ID x y}; close out;use neighbor; read all var {Nbor}; close neighbor;

NBor = num(Nbor); /* convert to numeric */

xEnd = x[Nbor];

yEnd = y[Nbor];

create Viz var {ID x y xEnd yEnd}; append; close Viz;

quit;

proc sgplot data=Viz noautolegend;

scatter y=y x=x / datalabel=id;

vector x=xEnd y=yEnd / xorigin=x yOrigin=y;

run;

***************************************************************************

More often than not, the analyst wants to find the pair of observations that are the closest in terms of certain metrics, such as Euclidean distance. For example, on the popular SAS-L archive group, Randall Powers posted such a question at here, but he was using PROC DISTANCE.

His idea is straightforward: First calculate the pairwise distance among all observations, and Second, find the one that has the shortest distance. For the first step, he was trying to use PROC DISTANCE, then then he planned to use DATA STEP to search through. This is not workable in even the modest data size, say 100K observations, because PROC DISTANCE will generate a dense double precision 100K-by-100K matrix and will blow up your workstation. That is the reason Randall turns to SAS-L for help.

Well, SAS actually has an old PROC that can readily solve his problem. Of course, before a solution can be crafted, there are several issues that have to be clarified first.

First, what if some observations share the same closest observations, do you allow shared case or you need a mutually exclusive solution that each observation only has its unique non-shared counterpart as the closest point, then how would you determine which one should take the shared observation? First come first serve or any other rule?

Second, when we talk about pairs, we know N should be even number, but what if the data we have has an odd number for N? How to deal with the left over?

That being said, let's consider the simplest and most tolerate case, that is each observation is allowed to serve as the closest point to multiple other observations and N is an even number.

The solution uses

**PROC MODECLUS.**This is an old procedure for density based clustering analysis, but its build-in nearest neighbor search algorithm and output capability make it the perfect candidate for this type of job. We use the example data in the "Getting Started" section.

The ODS OUTPUT statement directly outputs the list of nearest neighbors (closest points) for each observation and your have to specify either

**ALL**or

**NEIGHBOR**option in the PROC MODECLUS statement in order to use this functionality. In the same statement, we also specify

**K=**2. K=2 means 2 nearest neighbor but since this applies for all observations, for each one, the nearest neighbor is K-1=1. So if you specify K=3, you actually ask this procedure find 2 closest point for each 1 observation at hand.

The good thing about the nearest neighbor list data set is that it also contains the calculated distance, therefore, in cases you need to deal with more complex situation as listed above, such as non-shared nearest neighbor, etc, you need some room for post processing. Here you can specify K=3 and PROC MODECLUS will output 2 neighbors for each observation, like below.

You can see that observation 3 can be matched to either observation 1 or 2, but with 2, it yields the shortest distance, and if this is the rule you are going to apply, your post process can work on this data set to implement the rule.

Please feel free to verify the result. If you find error, let me know or post in the comment.

```
data example;
input x y @@;
ID=_n_;
datalines;
18 18 20 22 21 20 12 23 17 12 23 25 25 20 16 27
20 13 28 22 80 20 75 19 77 23 81 26 55 21 64 24
72 26 70 35 75 30 78 42 18 52 27 57 41 61 48 64
59 72 69 72 80 80 31 53 51 69 72 81
;
run;
ods select none;
ods output neighbor=neighbor; /* ODS output dataset */
proc modeclus data=example method=0 k=2 /*Find a pair of nearest neighbor*/
out=out
all /* required option in order to output Nearest Neighbors*/;
var x y;
id id;
run;
ods select all;
```

## Tuesday, March 26, 2013

### Large Scale Linear Mixed Model

Update at the end:

****************************;

Bob at r4stats.com claimed that a linear mixed model with over 5 million observations and 2 million levels of random effects was fit using lme4 package in R:

I am always interested in large scale mixed model like this and~~I left a reply to Bob on his blog but no response yet.~~ I also tried to Google this example but all examples returned are on a scale with hundreds up to 5 thousand of levels of random effects, which is very common and reasonable.
While I don't have the data used in the quoted example, I decided to see how long it takes for a mixed model with similar scale to be fitted in SAS's PROC HPMIXED. Therefore I borrowed Tianlin Wang's simulated data example in his paper: All the Cows in Canada: Massive Mixed Modeling with the HPMIXED procedure in SAS 9.2. But I modified the code a little bit so that there will be

%let NFarm = 1000;

%let NAnimal = %eval(&NFarm*400);

%let Va = 4.0;

%let Ve = 8.0;

data Sim;

keep Species Farm Animal Yield;

array BV{&NAnimal};

array AnimalSpecies{&NAnimal};

array AnimalFarm{&NAnimal};

do i = 1 to &NAnimal;

BV {i} = sqrt(&Va)*rannor(12345);

AnimalSpecies{i} = 1 + int( 5 *ranuni(12345));

AnimalFarm {i} = 1 + int(&NFarm*ranuni(12345));

end;

do i = 1 to 40*&NAnimal;

Animal = 1 + int(&NAnimal*ranuni(12345));

Species = AnimalSpecies{Animal};

Farm = AnimalFarm {Animal};

Yield = 1 + Species

+ Farm

+ BV{Animal}

+ sqrt(&Ve)*rannor(12345);

output;

end;

run;

proc sort data=sim; by animal species farm; run;

options fullstimer;

ods listing close;

ods select none;

proc hpmixed data=Sim;

class Species Farm Animal;

model Yield = Species Farm*Species /ddfm=residual;

random animal /cl ;

ods output SolutionR=EBV;

run;

ods listing;

For this particular model, it took more than

Take the above example, but using only 150 Farms and assuming 100 Animals within a Farm, we fit the same mixed model using

Left panel shows the status of the machine running

****************************;

Bob at r4stats.com claimed that a linear mixed model with over 5 million observations and 2 million levels of random effects was fit using lme4 package in R:

I am always interested in large scale mixed model like this and

**would appreciate anyone who can point me to the example Bob quoted (update: Bob responded in comment, please check it out)**.**400K cows**to be included in the random effect and I used SUBJECT= option to accelerate the computing with by-subject processing.%let NFarm = 1000;

%let NAnimal = %eval(&NFarm*400);

%let Va = 4.0;

%let Ve = 8.0;

data Sim;

keep Species Farm Animal Yield;

array BV{&NAnimal};

array AnimalSpecies{&NAnimal};

array AnimalFarm{&NAnimal};

do i = 1 to &NAnimal;

BV {i} = sqrt(&Va)*rannor(12345);

AnimalSpecies{i} = 1 + int( 5 *ranuni(12345));

AnimalFarm {i} = 1 + int(&NFarm*ranuni(12345));

end;

do i = 1 to 40*&NAnimal;

Animal = 1 + int(&NAnimal*ranuni(12345));

Species = AnimalSpecies{Animal};

Farm = AnimalFarm {Animal};

Yield = 1 + Species

+ Farm

+ BV{Animal}

+ sqrt(&Ve)*rannor(12345);

output;

end;

run;

proc sort data=sim; by animal species farm; run;

options fullstimer;

ods listing close;

ods select none;

proc hpmixed data=Sim;

class Species Farm Animal;

model Yield = Species Farm*Species /ddfm=residual;

random animal /cl ;

ods output SolutionR=EBV;

run;

ods listing;

For this particular model, it took more than

**8 hours**to be fitted on a desktop with i5-3570K CPU running at 3.8Ghz. I certainly dare not to test against a case with 2 million levels of random effects and would assume it will be a LONG LONG TIME. Note that with the original 10K levels of random effects, it took only**20 seconds**to be fitted on the same machine, and I do know the time consumption is more than super-linear in the number of random effects' levels with all the computing optimization in HPMIXED.*lme4*package supports Sparse matrix techniques only on the random effect design matrix, but HPMIXED procedure supports Sparse matrix techniques to the full Mixed Model Equation, and in many applications, will perform much better:Take the above example, but using only 150 Farms and assuming 100 Animals within a Farm, we fit the same mixed model using

*lme4*package and the HPMIXED procedure. The**lmer**function is called within SAS using the**%RScript**and**%CallR**macros (see here).Left panel shows the status of the machine running

**lmer**function on the mixed model and the right panel shows the status when the machine ran HPMIXED. Because*lme4*does not support sparse matrix techniques for fixed effect design matrix, it takes significant penalty on memory usage (12+GB vs 25MB) and computing time (244sec real time vs. 17sec real time). In a later test using only random effects, HPMIXED used 5.7 seconds while**lmer**used 11.3 second.Left: Running R; Right: Running SAS |

Labels:
PROC HPMIXED

## Tuesday, February 26, 2013

### Poor man's HPQLIM?

Tobit model is a type of censored regression and is one of the most important regression models you will encounter in business. Amemiya 1984 classified Tobit models into 5 categories and interested reader can refer to SAS online doc for details. In SAS, PROC QLIM is used to estimate various types of Tobit models as well as related class of models, such as Heckman's selection model, etc.

Jason (Blog @ Analytics-in-writing) in his recent post showed that PROC QLIM is not designed for large data, meaning for data set with millions or more records, and SAS developed HPQLIM, the High Performance version of QLIM, to deal with large scale modeling within Tobit model framework. While I celebrate the roll out of HPQLIM, I think the example Jason used is poorly chosen. He used standard Tobit model (type 1), while he really should use a more complicated one to demonstrate the full power of HPQLIM because there already is a solution in SAS to fit standard Tobit model on very large data set very fast.

Before we dive into the example Jason mentioned, it is useful to review the concept of censoring and truncation in regression analysis. Suppose there is a variable Y following a continuous distribution D and a constant K. After a value y is sampled from Y:

if y < K (y >K) and

if y < K (y >K) and

Jason showed that, when facing millions of records, PROC QLIM is very slow in estimating even the Type 1 Tobit model, the simplest left censoring at 0, and suggested to use HPQLIM which shows significant improvement in computing time. Unfortunately, not every business can afford High Performance (HPx) solution from SAS but Tobit model is so prevailing in observational study, people may wonder is there a way to get around this within the SAS environment but with much improved speed?

Yes, at least for Type 1 Tobit model. SAS Manual provide an exampling using PROC LIFEREG for Type 1 Tobit model since it is left censoring and the log likelihood function is no difference from a parametric survival model with corresponding distribution:

$$ l = \sum_{i \in (y=0)} \left [ 1- \Phi(X_i\beta/\sigma)\right ] + \sum_{i \in (y>0)} \left [ \Phi(X_i\beta/\sigma)/\sigma \right ] $$

PROC LIFEREG is significantly faster comparing to PROC QLIM as it is optimized for this particular type of model. In order to demonstrate the speed gain, I used a similar data as Jason used, the training sample from KDD CUP 1998. This is the original data where the sample Jason used is from, with 95K observations and 400+ variables. In consistent with Jason's test, we will use 10 randomly selected numerical variables in the model fitting, and stack the data 100 times so that the final data feed to PROC QLIM & PROC LIFEREG has about 9.5 million observations.

The SAS log below shows that, on a laptop with i5 2520M CPU clocked at 2.5Ghz and with 4G memory, PROC QLIM used almost 39 minutes while PROC LIFEREG only used less than 37 sec, more than 60 times difference. In comparison, HPQLIM in Jason's example uses 1/3 of the time of QLIM. Based on the real time data, Jason's OC (old client) machine runs about the same speed as my laptop using QLIM, so I expect the same results from HPQLIM if I could have run one, namely, 12 minutes real time, 27 minutes CPU time. Still not comparable to LIFEREG's less than 1 minute time usage.

6092

6093 %let keepvars=target_: age wealth1 hit CARDPROM numprom cardpm12 numprm12

6094 RAMNTALL NGIFTALL CARDGIFT MINRAMNT LASTGIFT

6095 ;

6096 data pva;

6097 set train(keep=&keepvars);

6098 do i=1 to 100;

6099 output;

6100 end;

6101 run;

NOTE: There were 95412 observations read from the data set WORK.TRAIN.

NOTE: The data set WORK.PVA has 9541200 observations and 15 variables.

NOTE: DATA statement used (Total process time):

real time 1:11.55

cpu time 5.26 seconds

6102

6103 proc qlim data=pva;

6104 model target_d = age wealth1 hit CARDPROM numprom cardpm12 numprm12

6105 RAMNTALL NGIFTALL CARDGIFT MINRAMNT LASTGIFT ;

6106 endogenous target_d~censored(lb=0);

6107 run;

NOTE: Convergence criterion (FCONV=2.220446E-16) satisfied.

NOTE: At least one element of the gradient is greater than 1e-3.

NOTE: PROCEDURE QLIM used (Total process time):

real time 39:51.24

cpu time 39:33.01

6108

6109 /* make it suitable for parametric survival model */

6110 data pva;

6111 set pva;

6112 if target_d=. then y=0; else y=target_d;

6113 run;

NOTE: There were 9541200 observations read from the data set WORK.PVA.

NOTE: The data set WORK.PVA has 9541200 observations and 16 variables.

NOTE: DATA statement used (Total process time):

real time 1:57.75

cpu time 11.82 seconds

6114

6115 proc lifereg data=pva;

6116 model (target_d, y) = age wealth1 hit CARDPROM numprom cardpm12 numprm12

6117 RAMNTALL NGIFTALL CARDGIFT MINRAMNT LASTGIFT ;

6118 run;

NOTE: Algorithm converged.

NOTE: PROCEDURE LIFEREG used (Total process time):

real time 36.53 seconds

cpu time 13.06 seconds

One additional advantage of PROC LIFEREG is that it also can very easily handle observation-by-observation censoring, which is interval censoring in a survival analysis.

The disadvantage of PROC LIFEREG comparing to PROC QLIM in the Tobit model framework is that PROC LIFEREG only computes predicted value for

For example, standard Tobit model usually assumes normal distribution for latent variable, and in this case, the expected observed value can be calculated as: $$ E(Y_i) = \Phi(X_i\beta/\sigma)(X_i\beta + \sigma \lambda)$$

where $$ \sigma=SCALE $$ is from estimation and $$\lambda=\phi(X_i\beta/\sigma)/\Phi(X_i\beta/\sigma)$$, where $$\phi(), \Phi()$$ indicate density function and cumulative probability function from standard normal distribution, respectively.

Alternatively, if a logistic regression is assumed, then $$ E(Y_i) = \sigma \log(1 + exp(X_i\beta/\sigma))$$

Even if with this extra step to calculate necessary quantities, the total time it would take is still significantly smaller than using PROC QLIM. As seen from the log, go through the whole data set computing these quantities with 9.5 million records took less than 2 minutes, so the total time using PROC LIFEREG is less than 3 minutes, that is still only 8% of what it takes by PROC QLIM. Recall that in Jason's poset, HPQLIM runs on a Greenplum Application with 35 working nods still takes 1 minutes 24 seconds to finish the fitting and scoring process on a data set with 7.7 million records.

Lastly but not the least, for this particular example, a Heckman's selection model is more appropriate because the intention to donate (the latent variable) is correlated with the amount of donation if the subject decides to donate, while a standard Tobit model literally assumes independence. And applying a Full Information Maximum Likelihood (FIML) estimation of Heckman's selection model on such a large data set is where HPQLIM really shine and there is no immediate replacement available in current SAS solution.

Reference:

SAS Institute, Inc., SAS/ETS User's Guide for v9.2, Cary, NC

SAS Institute, Inc., SAS/STAT User's Guide for v9.2, Cary, NC.

Amemiya, T. (1984), "Tobit Models: A Survey,"

Jason (Blog @ Analytics-in-writing) in his recent post showed that PROC QLIM is not designed for large data, meaning for data set with millions or more records, and SAS developed HPQLIM, the High Performance version of QLIM, to deal with large scale modeling within Tobit model framework. While I celebrate the roll out of HPQLIM, I think the example Jason used is poorly chosen. He used standard Tobit model (type 1), while he really should use a more complicated one to demonstrate the full power of HPQLIM because there already is a solution in SAS to fit standard Tobit model on very large data set very fast.

Before we dive into the example Jason mentioned, it is useful to review the concept of censoring and truncation in regression analysis. Suppose there is a variable Y following a continuous distribution D and a constant K. After a value y is sampled from Y:

if y < K (y >

**discard**the whole record, then it is left (right) truncation;

if y < K (y >

**report**y=K, then it is left (right) censoring.

Jason showed that, when facing millions of records, PROC QLIM is very slow in estimating even the Type 1 Tobit model, the simplest left censoring at 0, and suggested to use HPQLIM which shows significant improvement in computing time. Unfortunately, not every business can afford High Performance (HPx) solution from SAS but Tobit model is so prevailing in observational study, people may wonder is there a way to get around this within the SAS environment but with much improved speed?

Yes, at least for Type 1 Tobit model. SAS Manual provide an exampling using PROC LIFEREG for Type 1 Tobit model since it is left censoring and the log likelihood function is no difference from a parametric survival model with corresponding distribution:

$$ l = \sum_{i \in (y=0)} \left [ 1- \Phi(X_i\beta/\sigma)\right ] + \sum_{i \in (y>0)} \left [ \Phi(X_i\beta/\sigma)/\sigma \right ] $$

PROC LIFEREG is significantly faster comparing to PROC QLIM as it is optimized for this particular type of model. In order to demonstrate the speed gain, I used a similar data as Jason used, the training sample from KDD CUP 1998. This is the original data where the sample Jason used is from, with 95K observations and 400+ variables. In consistent with Jason's test, we will use 10 randomly selected numerical variables in the model fitting, and stack the data 100 times so that the final data feed to PROC QLIM & PROC LIFEREG has about 9.5 million observations.

The SAS log below shows that, on a laptop with i5 2520M CPU clocked at 2.5Ghz and with 4G memory, PROC QLIM used almost 39 minutes while PROC LIFEREG only used less than 37 sec, more than 60 times difference. In comparison, HPQLIM in Jason's example uses 1/3 of the time of QLIM. Based on the real time data, Jason's OC (old client) machine runs about the same speed as my laptop using QLIM, so I expect the same results from HPQLIM if I could have run one, namely, 12 minutes real time, 27 minutes CPU time. Still not comparable to LIFEREG's less than 1 minute time usage.

6092

6093 %let keepvars=target_: age wealth1 hit CARDPROM numprom cardpm12 numprm12

6094 RAMNTALL NGIFTALL CARDGIFT MINRAMNT LASTGIFT

6095 ;

6096 data pva;

6097 set train(keep=&keepvars);

6098 do i=1 to 100;

6099 output;

6100 end;

6101 run;

NOTE: There were 95412 observations read from the data set WORK.TRAIN.

NOTE: The data set WORK.PVA has 9541200 observations and 15 variables.

NOTE: DATA statement used (Total process time):

real time 1:11.55

cpu time 5.26 seconds

6102

6103 proc qlim data=pva;

6104 model target_d = age wealth1 hit CARDPROM numprom cardpm12 numprm12

6105 RAMNTALL NGIFTALL CARDGIFT MINRAMNT LASTGIFT ;

6106 endogenous target_d~censored(lb=0);

6107 run;

NOTE: Convergence criterion (FCONV=2.220446E-16) satisfied.

NOTE: At least one element of the gradient is greater than 1e-3.

NOTE: PROCEDURE QLIM used (Total process time):

real time 39:51.24

cpu time 39:33.01

6108

6109 /* make it suitable for parametric survival model */

6110 data pva;

6111 set pva;

6112 if target_d=. then y=0; else y=target_d;

6113 run;

NOTE: There were 9541200 observations read from the data set WORK.PVA.

NOTE: The data set WORK.PVA has 9541200 observations and 16 variables.

NOTE: DATA statement used (Total process time):

real time 1:57.75

cpu time 11.82 seconds

6114

6115 proc lifereg data=pva;

6116 model (target_d, y) = age wealth1 hit CARDPROM numprom cardpm12 numprm12

6117 RAMNTALL NGIFTALL CARDGIFT MINRAMNT LASTGIFT ;

6118 run;

NOTE: Algorithm converged.

NOTE: PROCEDURE LIFEREG used (Total process time):

real time 36.53 seconds

cpu time 13.06 seconds

One additional advantage of PROC LIFEREG is that it also can very easily handle observation-by-observation censoring, which is interval censoring in a survival analysis.

The disadvantage of PROC LIFEREG comparing to PROC QLIM in the Tobit model framework is that PROC LIFEREG only computes predicted value for

**uncensored**observations, but in Tobit model, the expected value of**censored**observations is of particular interests. To do this, analyst will have to use DATA STEP to manually compute this value for each of censored observations. But this is easy.For example, standard Tobit model usually assumes normal distribution for latent variable, and in this case, the expected observed value can be calculated as: $$ E(Y_i) = \Phi(X_i\beta/\sigma)(X_i\beta + \sigma \lambda)$$

where $$ \sigma=SCALE $$ is from estimation and $$\lambda=\phi(X_i\beta/\sigma)/\Phi(X_i\beta/\sigma)$$, where $$\phi(), \Phi()$$ indicate density function and cumulative probability function from standard normal distribution, respectively.

Alternatively, if a logistic regression is assumed, then $$ E(Y_i) = \sigma \log(1 + exp(X_i\beta/\sigma))$$

Even if with this extra step to calculate necessary quantities, the total time it would take is still significantly smaller than using PROC QLIM. As seen from the log, go through the whole data set computing these quantities with 9.5 million records took less than 2 minutes, so the total time using PROC LIFEREG is less than 3 minutes, that is still only 8% of what it takes by PROC QLIM. Recall that in Jason's poset, HPQLIM runs on a Greenplum Application with 35 working nods still takes 1 minutes 24 seconds to finish the fitting and scoring process on a data set with 7.7 million records.

Lastly but not the least, for this particular example, a Heckman's selection model is more appropriate because the intention to donate (the latent variable) is correlated with the amount of donation if the subject decides to donate, while a standard Tobit model literally assumes independence. And applying a Full Information Maximum Likelihood (FIML) estimation of Heckman's selection model on such a large data set is where HPQLIM really shine and there is no immediate replacement available in current SAS solution.

Reference:

SAS Institute, Inc., SAS/ETS User's Guide for v9.2, Cary, NC

SAS Institute, Inc., SAS/STAT User's Guide for v9.2, Cary, NC.

Amemiya, T. (1984), "Tobit Models: A Survey,"

*Journal of Econometrics*, 24, 3–61.## Monday, December 10, 2012

### Kaggle Digit Recoginizer: SAS k-Nearest Neighbor solution

The data is pre-processed from raw images using NIST standardization program, but it noteworthy some extra efforts to conduct more exploratory data analysis (EDA). Because this is a pre-processed data, therefore no missing data issues and EDA will focus on descriptive statistics of input variables. The code below read in the data which is stored in folder "D:\DATA\MNIST";

```
%let folder=D:\Data\MNIST;
```

`libname digits "&folder"; `

`data digits.MNISTtrain; `

` infile "&folder.\train.csv" lrecl=32567 dlm=',' firstobs=2; `

` input label x1-x784; `

`run; `

`data digits.MNISTtest; `

` infile "&folder.\test.csv" lrecl=32567 dlm=',' firstobs=2;`

` input x1-x784; `

`run; `

Figure 1 below shows the distribution of variances of input variables in the Training set. Apparently there are many input variables of little information and they should be removed before further data processing and actual modeling. Next we examine Coefficient of Variation (CoV) of input variables. CoV gives a more consistent way to compare across variables. The result is demonstrated in figure 2. Note that variables with 0 variance will have their CoV be set to 0.

```
ods select none;
```

`proc means data=digits.MNISTtrain ; `

` var x1-x784;`

` output out=stat(where=(_STAT_ in ('STD', 'MEAN'))); `

`run; `

`ods select all; `

`proc transpose data=stat out=stat_t; `

` var _STAT_ x:; `

`run; `

`data stat_t; `

` set stat_t; `

` ID=_n_-1; `

` if col2=0 then CoV=0; else CoV=col1/col2; `

`run; `

`proc sgplot data=stat_t; `

` scatter y=col2 x=ID; `

` label ID="X #" col2="StdErr"; `

`run; `

`proc sgplot data=stat_t; `

` scatter y=CoV x=ID; `

` label ID="X #"; `

`run; `

Figure 1. StdErr by input features |

Figure 2. Coefficient of Variation by input features |

In this demonstration, we select features that have a variance larger than 10% of quantiles. Even so, we still end up with more than 700 features. Note that in SAS, the computing time of kNN classification increases in superlinear and kNN itself suffers curse of dimensionality, therefore it is desirable to conduct dimension reduction first before applying kNN classifiers. While there are so many different techniques to choose from, we decided to go with SVD, which is both widely used and simple to implement in SAS, see here for sample code and examples. The code below applies SVD to the data that combining both training and testing. This is important because SVD is sensitive to input data and include both data sets in the computing won't violate any best practice rules in predictive model because no information from dependent variable is used. The left eigenvectors are also standardized to have 0 mean and unit standard deviation.

data all;

set digits.MNISTtrain(in=_1)

digits.MNISTtest(in=_2);

if _1 then group='T';

else group='S';

run;

proc princomp data=all out=allpca(drop=x:) outstat=stat noint cov noprint;

var x1-x784;

run;

proc standard data=allpca out=allpcastd mean=0 std=1;

var Prin:;

run;

data allpcastd;

set allpcastd;

array _p{*} p1-p10 (10*0.1);

cv=rantbl(97897, of _p[*]);

drop p1-p10;

run;

After the data is compressed, we split them into 10 folds of subsets, flag each of them and search for best tuning parameter. In building a simple kNN classifier, there are two tuning parameters to decide, 1.what features to use and 2. the number of neighbors to use. In this exercise, we simplify the first problem into searching for the number of features, always starting from feature #1. We then calculate the error rate for each combination. Before we start a brutal force search, we have to understand that search through the whole space of combination of 700 features and 1 to 25 neighbors is not practical, it just takes too much time, may run for over 2 days on a regular PC. But we can narrow down to a range of number of features by testing number of features at every 10th interval, say test 1 feature, 11 features, 21 features,..., 201 features. In classification, kNN is more sensitive to features used instead of number of neighbors. The code below executes such experiment and will take about 18 hours to finish on my PC equipped with i5-3570 CPU, clocked at 3.8Ghz. In actual execution, I used my proprietary

**%foreach MACRO**that distribute the job to 3 cores of the CPU, and shaved the total running time to only 6 hours.

/*@@@@@@@@@@ USE 10-FOLD CV @@@@@@@@@@*/

%macro loopy;

%do k=2 %to 8;

%do i=1 %to 10;

%let t0=%sysfunc(time());

options nosource nonotes;

proc discrim data=allpcastd(where=(group='T' & cv^=&i))

method=npar k=&k noprint

testdata=allpcastd(where=(group='T' & cv=&i))

testout=out(drop=Prin:)

;

class label;

var Prin1-Prin30;

run;

proc freq data=out noprint;

table label*_INTO_/out=table;

run;

proc sql;

create table _tmp as

select &i as cv, &k as K, 31 as dim, sum(PERCENT) as correct

from table

where label=_INTO_

;

quit;

%let t1=%sysfunc(time());

data _null_;

dt=round(&t1-&t0, 0.1);

call symput('dt', dt);

run;

%put k=&k i=&i time=&dt;

%if %eval(&k+&i)=3 %then %do;

data result; set _tmp; run;

%end;

%else %do;

data result; set result _tmp; run;

%end;

%end;

%end;

options source notes;

%mend;

options nosource nonotes;

%loopy;

options source notes;

proc means data=result noprint;

by k;

var correct;

output out=result_summary mean(correct)=mean std(correct)=std;

run;

Figure 3. Correct% vs. Number of Features by Number of Neighbors |

Now we see that the best possible number of features may exist somewhere between 21 to 41 while the number of neighbors has a much smaller influence except for using only 2 NN. Now we are in a good situation to conduct more greedy search over the feature space. We are going to search through all combinations of number of features from 21 to 41 at single interval and neighbors 3 to 7, which results in a total number of 120 iterations each runs over the 10 folds, and the code ran about 3 hours on my PC. The code below executes this exercise and the result is summarized in table 1. We can see that using up to 31 features and 3 nearest neighbors will be the best choice for the data at hand.

%macro loopy;

%do k=2 %to 7;

%do d=21 %to 41;

%do i=1 %to 5;

%let t0=%sysfunc(time());

options nosource nonotes;

proc discrim data=allpcastd(where=(group='T' & cv^=&i ))

method=npar k=&k noprint

testdata=allpcastd(where=(group='T'))

testout=out(drop=Prin:)

;

class label;

var Prin1-Prin&d;

run;

proc freq data=out noprint;

table label*_INTO_/out=table;

run;

proc sql;

create table _tmp as

select &i as cv, &k as K, &d as dim, sum(PERCENT) as correct

from table

where label=_INTO_

;

quit;

%let t1=%sysfunc(time());

data _null_;

dt=round(&t1-&t0, 0.1);

call symput('dt', dt);

run;

%put k=&k d=&d i=&i time=&dt;

%if %eval(&k=2) & %eval(&d=21) %then %do;

data result; set _tmp; run;

%end;

%else %do;

data result; set result _tmp; run;

%end;

%end;

%end;

%end;

options source notes;

%mend;

options nosource nonotes;

%loopy;

options source notes;

proc means data=result noprint nway;

class k dim;

var correct;

output out=result_summary2 mean(correct)=mean std(correct)=std;

run;

proc means data=result_summary2 nway noprint;

class k;

var mean;

output out=result_max idgroup(max(mean) out[1](mean std dim)=mean std dim);

run;

proc sgplot data=result ;

where i=.;

dot dim /response=correct stat=mean limitstat=stddev numstd=1 group=k;

label dim="# of Features";

run;

Figure 4. Correct% and Confidence Interval by # of features and # of neighbors. |

Table 1. Tuning Parameters Selection |

Now, re-score the data using chosen tuning parameters: up to 33 features and 3-NN, we end up an accuracy of

**96.80%**on the leader-board, which certainly is not the best result you can get from kNN method, but easily beat both the 1NN and ramdomForest benchmark results.

With a little bit more tweak, the accuracy from a kNN with similar parameters can be boosted to 97.7%.

```
``````
```

Labels:
Data Mining,
KNN,
predictive modeling,
PROC DISCRIM,
SVD

## Sunday, November 25, 2012

### KNN Classification and Regression in SAS

PDF available at here. Related post on KNN classification using SAS is here.

In data mining and predictive modeling, it refers to a memory-based (or instance-based) algorithm for classification and regression problems. It is a widely used algorithm with many successfully applications in medical research, business applications, etc. In fact, according to Google Analytics, it is the second most viewed article on my SAS programming blog of all time, with more than 2400 views in past year.

In classification problems, the label of potential objects is determined by the labels of closest training data points in the feature space. The determination process is either through "majority voting" or "averaging", potentially with weighting. In "majority voting", the label of object is assigned to be the label which most frequent among the k closest training examples. In "averaging", the object is not assigned a label, but instead, the ratio of each class among the k closest training data points. In Regression problems, the property of object is obtained via a similar "averaging" process, where the value of the object is the average value of the k closest training points.

In practice, both the "majority voting" and "averaging" process can be refined by adding weights to the k closest training points, where the weights are proportional to the distance between object and the training point. In this way, the closest points will have the biggest influence on the final results.

In fact, the SAS implementation of kNN classification has the averaging process be weighted by volume size. Interested readers are encouraged to read the manual for details. kNN algorithm has several characteristics that worth addressing. Understanding its advantages and disadvantages in theory and in practice will help practicioners better leverage this tool. Generally, KNN algorithm has 4 advantages worth noting. It is simple to implement.

Theoretically, kNN algorithm is very simple to implement. The naive version of the algorithm is easy to implement. For every data point in the test sample, directly computing the desired distances to all stored vectors, and choose those shortest k examples among stored vectors. It is, however, computationally intensive, especially when the size of the training set grows. Over the years, many nearest neighbor search algorithms have been proposed, seeking to reduce the number of distance evaluations actually performed. Using an appropriate nearest neighbor search algorithm makes k-NN computationally feasible even for large data sets.

In SAS, the k-d Tree data structure and associated search algorithm of Friedman et. al. is implemented. It is analytically tractable. The error rate for 1-NN classification is twice as much as Bayesian Error in asymptotic sense. It is highly adaptive to local information. kNN algorithm uses the closest data points for estimation, therefore it is able to take full advantage of local information and form highly nonlinear, highly adaptive decision boundaries for each data point. It is easily implemented in parallel. Because it is instance based, for each data point to be scored, the algorithm check against the training table for the k nearest neighbor. Since each data point is independent of the others, the execution of search and score can be conducted in parallel.

On the other hand, kNN algorithm has 3 disadvantages, which will also be addressed in details in section PRACTICAL ISSUES ON IMPLEMENTATION. It is computationally very intensive, because for each records to be classified and/or scored, the program has to scan through available data points to find the nearest neighbors and then determine which ones to use for further calculation. It is demanding in storage, because all records in training sample have to be stored and used whenever scoring is necessary. It is highly susceptible to curse of dimensionality, because as dimension increases, the distance calculation and finding the appropriate space to contain at least one relevant record is becoming difficult.

In order to conduct either KNN Classification or KNN regression in SAS, there are two basic approaches. The first one utilize special options in certain SAS Procedures and conduct the KNN analysis directly. I call it "Direct Approach"; while the second method will first find the nearest data points explicitly leveraging search capability in some SAS procedures and manually calculate the final response values, which is simple anyway. I call it "Manual Approach". To take the manual approach, there are three PROCs that are capable to find nearest neighbor points in efficient way, namely

The

method = npar k=5 .............................[2]

test = toscore ..............................[3]

testout = toscore_out...........................[4]

;

class y; ...............................................[5]

var x1-x10; ............................................[6]

run;

We explain the function of each statement below.

Statement [1] tells SAS that \verb|PROC DISCRIM| should be called to process the data named \emph{train}.

Statement [2] tells SAS to apply kNN Classification method using 5 nearest neighbors.

Statement [3] tells SAS that the classification rule be applied to a test data called 'toscore'.

Statement [4] tells SAS to output classification result for the test data and to name the output data as \emph{toscoreout}.

Statement [5] defines Y as the dependent variable to use in \verb|PROC DISCRIM|.

Statement [6] tells SAS to use x1 to x10 as input features in calculating the distance to neighbors.

kNN classification is computationally intensive. Based on SAS manual [4], for the specific tree search algorithm implemented in \verb|PROC DISCRIM|, the time usage, excluding I/O time, is roughly proportional to $$\log(N)*(N*P)$$, where N is the number of observations and P is the number of variables used. kNN is a memory-based method, when an analyst wants to score the test data or new data in production, the whole raw table will be loaded to memory in the scoring process. In computing, \verb|PROC DISCRIM| uses memory approximately proportional to the second order of number of variables, i.e. proportional to $$P^2$$. Therefore, it would be interesting to see how the cost of computing, in terms of memory usage and time consumption, increases as the number of observations in training data and the number of features are increasing. \subsection{kNN Regression using SAS} kNN technique can be applied to regression problems, too, but the coding in SAS is not as straightforward as in a classification problem. kNN regression uses the average value of dependent variable over the selected nearest neighbors to generate predicted value for scoring data point. In SAS, we can use \verb|PROC KRIGE2D| to conduct this analysis, but it needs more cautions and tweaks. The following example code instructs SAS to do a kNN regression on variable Y using 5 nearest neighbors and 2 features: X1 and X2.

Statement [2] tells \verb|KRIGE2D| that the names of the 2 coordinates to be used, or in the language of data mining, the 2 features to be used. In this example, it is X1 and X2.

Statement [3] tells \texttt{KRIGE2D} to use 5 nearest neighbors to predicts dependent variables's value at current grid position. As to be discussed, the number of nearest neighbors is a tuning parameter in this algorithm, and should be determined using data driven approaches, such as Cross Validation.

Statement [4] tells \texttt{KRIGE2D} what model to be used in kriging using 'FORM=' option and the associated parameters, scale and range. Note that, in order to conduct a kNN regression, it is required to specify a large scale value and a small range value. For example, in example above, scale is set to 10 while range is set to 1E-7. Typically, a scale value larger than 1 and a range value smaller than 1E-7 work well. The type of model is irrelevant. The reason why this works is beyond the scope of this paper and we will not discuss on this issue here.

Statement [5] tells \texttt{KRIGE2D} that the name of the data to be scored is \emph{tosore} in option \texttt{GRIDDATA=} and the corresponding names of features (coordinates) to be used for scoring.

\verb|PROC KRIGE2D| was designed to perform ordinary kriging in two dimensions, therefore it is not fully functional in the standard kNN regression practice. There are two major issues with \verb|PROC KRIGE2D| when it is applied in a kNN regression analysis.

The first problem is that \verb|PROC KRIGE2D| accept two and only two inputs as features because it treat kNN regression as a two dimensional kriging. The second problem is due to the nature of semivariogram model used in kriging. In cases where the data are dense and highly overlapped, \verb|PROC KRIGE2D| will report singularity in kriging system. However, these issues should not bother analysis in practice and there are good reasons which will be detailed below.

Overcome Fixed Number of Features in KRIGE2D:

In a real kNN regression application, the number of features ranges from 1 to hundreds if necessary, but \verb|PROC KRIGE2D| requires exactly two features as inputs. When only 1 feature is used, or when more than 2 features are required, the analysts can take a detour by pre-processing the data up front. We will discuss these two cases separately below.

Only one feature. When only one feature is used, we can set up a dummy input, which is a constant number, in both the training data and scoring data. The dummy variable will cancel out in calculating, thus it has no effect virtually. The code below shows a demonstration:

data trainv/view=trainv;

set train;

dummy=1;

run;

data toscorev/view=toscorev;

set toscore;

dummy=1;

run;

proc krige2d data=trainv outest=pred outn=knn;

coordinate xc=dummy yc=feature1;

predict var=Y NP=5;

model scale=10 range=1E-7 form=exp;

grid gdata=toscorev xc=dummy yc=feature1;

run;

In the example code above, we used SAS data view to avoid re-write the data sets and save computing resources. This technique is explained in \cite{view}.

More than two features. When more than two features are used, there are two approaches to handle this problem. The first approach conduct dimension reduction upfront and \verb|PROC KRIGE2D| take two derived new dimensions as features to perform kNN regression; the second approach follows the ensemble principle \cite{ESL}, and randomly select two features out of the pool of available features. The final kNN regression prediction is an average of predictions from each of the ensemble regressions. The first approach can also use the ensemble principle, too, in practice. There are many dimension reduction techniques, and among all, the most widely used is Singular Value Decomposition (SVD), or its equivalent counterpart Principal Components Analysis (PCA). Details of SVD or PCA is beyond the scope of this paper and interested readers can refer to \cite{PCA}, \cite{Matrix}.

The example code in Appendix 1 demonstrates the first method where we conduct dimension reduction using PCA upfront and use the leading two orthogonal bases for kNN regression. Note that in the example code, we first combine both train and score data sets together because PCA is sample sensitive \cite{PCA}, and using records from both training data and scoring data will stabilize the generated orthogonal bases.

The second method is demonstrated below. It is worth noting that when the number of features is small, say 10, it is a good practice to select all pairwise combination as input features and ensemble all results at the final stage. On the other hand, when the number of such combinations blows as the number of features increases, a random sample of pairwise combination works fine, just like in a RandomForest algorithm. In the code below, we assume 100 features which are stored in a table called FEATURES and the ensemble iterates 50 times, each with a pair of randomly selected features.

The '

MODEL SCALE=10 RANGE=1e-8 FORM=EXP;

In the extreme cases, while inconvenient, it is necessary to manually process the data containing nearest neighboring points from

Optimize Storage. kNN classification requires a lot of storage because this is a in-memory algorithm, and all the training information to score a new data will be load into the memory to build a kd-Tree when conduct scoring. Therefore, when the size of the training data is large, and when the number of variables used is large, the required memory storage will be non-trivial. According to SAS/STAT Manual, the required storage for a kNN classification is at least

1. Only choose the necessary variables in kNN analysis. In many cases, 2 or 3 features is enough, actually. On the other hand, like in RandomForest algorithm, multiple runs of kNN algorithm were conducted, each with a small subset of variables, and these classifiers or regression results will be ensembeled to be the final solution.

2. When selected number of variables is still relatively large, one commonly used technique for dimension reduction is Singular Value Decomposition (SVD), or equivalently, the Principal Component Analysis (PCA). Because kNN algorithm is highly adaptive to local information, SVD or PCA won't suffer from common critics that the information in dependent variable is not used when variable reduction is used. This technique is demonstrated in the Digits Recognition example below.

kNN algorithm used in classification and regression suffers a lot from curse of dimensionality, therefore, choosing only necessary and more useful features for the problem in hand is critical for the success of application and implementation.

Optimize Computation. kNN classification is also computationally intensive. Unlike many classification methods, such as logistic regression, discriminant analysis and other parametric regressions, kNN has no condense mathematical form to represent the information in training data, and lack of explicit model training. When scoring is needed, the training data is simply used to populate the sample search space with known results, and scoring is no different from a searching and comparison process, which is very time consuming. In the most naive way of searching, the algorithm compares scoring current data point to all data in the training sample, calculate desired distance metrics and select the top k nearest neighbors for their response information. This is computationally very redundant and intensive and doesn't leverage the distribution information in training sample well. Modern algorithm relies on heuristic searching algorithms and one of the most significant one is kd-Tree \cite{kd-tree}. kd-Tree stands for k-dimensional tree, and it is a space-partitioning data structure to organize data points in k-dimensional space for efficient search. In SAS,

The data is extracted using the NIST preprocessing routines \cite{NIST} from normalized bitmaps of handwritten digits from a preprinted form. The original bitmaps are 32X32 but divided into non-overlapping blocks of 4X4 and the the number of on pixels are counted in each block. This generates an input matrix of 8x8 where each element is an integer in the range 0..16. This reduces dimensionality to 64 instead of 1024, and gives invariance to small distortions. In this project, the following steps are performed to fine tune the kNN classifier.

First, the training and testing data in CSV format are imported at \emph{Step 1}.

Then a baseline 1-NN classifier using all 64 features is built for comparison purpose at \emph{Step 2}. This baseline classifier return an error rate of 3.73\%.

In \emph{Step 3}, a fine tuned kNN classifier is built, where the tuning parameter k, number of closest points, and d, number features, are selected based on 10-fold Cross Validation.

Besides, instead of using the original 64 features, a Singular Value Decomposition (SVD) was conducted on the training and testing data sets combined to obtain derived features. This data pre-processing technique brings in several advantages.

First, the derived features are ranked by how much variance in the feature space each explained, so that the selection of features becomes simpler using CV. Second, SVD changes original sparse feature space into a dense one, and in general fewer closest data points are required to obtain optimal performance in this case.

In \emph{Step 4}, the selected kNN classifier, with k=4 closest data points and d=36 derived features, is applied to the testing data set, obtaining an error rate of 1.89\%.

As an alternate to 10-fold CV, in \emph{Step 5}, Leave-One-Out CV was used to fine tune the kNN classifier, resulting a set of parameters of k=6 and d=4.

In \emph{Step 6}, the new classifier was applied to the testing data set, resulting in an error rate of 1.4\%. \end{enumerate} The 10-fold Cross Validation process is pretty computationally intensive even for such a small data set. Scanning across 900 different combinations of k, d pairs took almost 25 minutes on a main stream PC. The pay off worth it, however, as the error rate drops to 1.89\% from 3.73\%, a 49\% improvement. On the other hand, using Leave-One-Out Cross Validation process took 22 minutes on the same PC, and the selected parameter pair is k=6, d=38, with an error rate of 2.06\% from the same scoring data set, which is not as good as the result from 10-fold CV but still way better than baseline result.

The 10-fold CV is preferred to Leave-One-Out cross validation from pure computational perspective, as the computing time for the latter one increases quadratically and will become infeasible when the number of observation is more than 100K. Figure \ref{fig:C1} shows the error rate as a function of k and d, while figure \ref{fig:C2} zooms in for a closer look at area where number of features is between 20 and 40 and closest data points is smaller than 5. We can see that globally, the error rate is a relatively smooth function of number of features over all and the number of neighbors does not influence the error rate as much as the number of features does, while locally there is significant variation as shown in figure \ref{fig:C2}. This implies that one heuristic way to scan for the best possible numbers of features is bi-section searching.

The complete SAS code that conducts brute force searching for tuning parameters can be found in Appendix 1. kNN Regression Example. As a demonstration of kNN regression, we used the \emph{Galaxy} data set from the book \emph{Elements of Statistical Learning 2}, which is available at \cite{galaxy}. The galaxy data was used to demonstrate smoothing techniques in the book and was visualized in figure 6.8, where local linear regression with nearest neighbor window of 15\% of the data in $\textrm{R}^2$ was applied to predict velocity data of galaxies.

In our example, a k-Nearest Neighbor prediction is applied, where two values of k are used: k=5 and k=48. k=48 is equivalent to using 15\% of the data as neighbor while k=5 is arbitrarily chosen to demonstrate the fact that smaller number of neighbors will produce more accurate but higher variance prediction while larger value of k will product less accurate but smaller variance hence smoother prediction.

Combine the two predictions in many cases may product better results among all. In figure \ref{fig:galaxy}, 3-D views of actual data and smoothed surface using different smoothing parameters are shown. The upper left panel shows the raw data. Obviously, the galaxy has a star like spatial shape with a total of about 10 arms extending outside from the center. The upper left panel shows smoothed 3-D surface using 5-nearest neighbors while the lower left panel shows smoothed 3-D surface using 48-nearest neighbors. It is noticeable that using less neighbors produce a rougher surface but adaptive better to local where there are more data, such as along the galaxy arms, but in more empty spaces, the prediction is wilder. On the other hand, using 48 neighbors produces much smoother surface yet the galaxy arms are less visible. The lower right panel shows the average of these two surface maps. This averaged out map keeps the galaxy arms clearly visible yet smooth out empty space pretty well.

The complete SAS code is in Appendix 2.

kNN algorithm is an old but very powerful technique if used appropriately. It has been shown that both kNN Classification and kNN regression can be carried out within SAS/STAT very easily. The functions are basic but more than enough to handle many real world problems. Both the theory and practice are discussed, with example SAS code demonstrated. The discussion here, however, is only the tip of iceberg. There are many advanced topics that can't be fitted into this short article. These topics include but not limited to: customized distance function, tangent invariance, etc. Interested readers are encouraged to read the related books and papers.

/*@@@@ --- Step 1 --- @@@@*/

libname digits "C:\_Data\Digits";

data digits.train;

infile "C:\_Data\Digits\optdigits.tra" lrecl=512 dlm=',';

input x1-x64 y;

run;

data digits.test;

infile "C:\_Data\Digits\optdigits.tes" lrecl=512 dlm=',';

input x1-x64 y;

run;

/*@@@@ --- Step 2 --- @@@@*/

proc discrim data=digits.train testdata=digits.test

testout=testout

method=npar k=1 noprint;

class y;

var x1-x64;

run;

proc freq data=testout;

table y*_INTO_;

run;

/*@@@@ --- Step 3 --- @@@@*/

data all;

set digits.train(in=_1) digits.test(in=_2);

if _1 then Group='T';

else Group='S';

run;

proc princomp data=all out=pca noprint noint cov;

var x1-x64;

run;

proc standard data=pca out=pcaout mean=0 std=1;

var Prin1-Prin62;

run;

/*@@@@@ 10-fold CV approach @@@@@@@@@@*/

data pcaout;

set pcaout;

array _xp{10} p1-p10 (10*0.1);

cv=rantbl(2342342, of _xp[*]);

run;

proc freq data=pcaout; table cv; run;

%macro loopy;

ods select none;

options nonotes nosource;

%do k=1 %to 15;

%do n=2 %to 60;

%put k=&k n=&n;

%do cv=1 %to 10;

proc discrim data=pcaout(where=(group='T' & cv^=&cv))

testdata=pcaout(where=(group='T' & cv=&cv))

testout=testout

method=npar k=&k noprint;

class y;

var Prin1 - Prin&n;

run;

proc freq data=testout;

table y*_INTO_ /out=result;

run;

proc sql;

create table _tmp&cv as

select sum(COUNT) as total,

sum((y=_INTO_)*COUNT) as correct,

&k as k, &n as n,

1-calculated correct / calculated total as error format percent7.3

from result

;

quit;

%end;

data _tmp;

set %do cv=1 %to 10; _tmp&cv %end;;

run;

proc means data=_tmp noprint;

var error;

output out=_tmp mean(error)=error std(error)=error_std;

run;

data _tmp;

length k n 8;

set _tmp;

k=&k; n=&n; drop _TYPE_;

run;

%if %eval(&k+&n=3) %then %do;

data iter; set _tmp; run;

%end;

%else %do;

proc append base=iter data=_tmp force; run;

%end;

%end;

%end;

ods select all; options notes source;

%mend;

dm log 'clear';

%let time0=%sysfunc(time());

%loopy;

%let time1=%sysfunc(time());

data _null_;

dt=(&time1-&time0)/60; put dt= 'Min';

run;

data iter;

set iter; format error percent7.4;

run;

proc transpose data=iter out=itert name=n prefix=n;

by k;

var error;

id n;

run;

data itert;

set itert;

array _n{*} n2-n60;

min=min(of _n[*]); max=max(of _n[*]); format min max percent7.4;

run;

/*@@@@ --- Step 4 --- @@@@*/

/*@@@@ use selected parameter to score the testing data @@@@*/

proc discrim data=pcaout(where=(group='T')) testdata=pcaout(where=(group='S'))

testout=testout(drop=x: Prin:)

method=npar k=4 noprint;

class y;

var Prin1-Prin36;

run;

proc freq data=testout;

table y*_INTO_ /out=result;

run;

proc sql;

create table temp as

select sum(COUNT) as total,

sum((y=_INTO_)*COUNT) as correct,

4 as k, 36 as n,

1-calculated correct / calculated total as error format percent7.3

from result

;

quit;

/*@@@@ --- Step 5 --- @@@@*/

/*@@@@@ Use leave-one-out cross-validation approach @@@@@@*/

%macro loop;

%do k=1 %to 15;

%do i=1 %to 62;

proc discrim data=pcaout(where=(group='T'))

method=npar k=&k noprint crossvalidate

outcross=outcross out=out;

class y;

var Prin1-Prin&i;

run;

proc freq data=outcross noprint;

table y*_INTO_ /out=table;

run;

proc sql;

create table _tmp as

select &k as K, &i as I, sum(PERCENT) as correct

from table

where y=_INTO_

;

quit;

%if %eval(&i+&k)=2 %then %do;

data result; set _tmp; run;

%end;

%else %do;

proc append base=result data=_tmp; run;

%end;

%end;

%end;

%mend;

options nonotes nosource;

%let time0=%sysfunc(time());

dm log 'clear';

%loop;

options notes source;

%let time1=%sysfunc(time());

data _null_;

dt=(&time1-&time0)/60;

put dt= 'Min';

run;

proc transpose data=result out=result_t name=i prefix=n;

by k;

var correct;

id i;

run;

data result_t;

set result_t;

array _n{*} n2-n62;

min=min(of _n[*]); max=max(of _n[*]);

loc_min=whichN(min, of _n[*]); loc_max=whichN(max, of _n[*]);

format min max best.;

run;

proc discrim data=pcaout(where=(group='T'))

testdata=pcaout(where=(group='S'))

testout=testout(drop=x: Prin:)

method=npar k=6 noprint;

class y;

var Prin1-Prin38;

run;

proc freq data=testout;

table y*_INTO_ /out=result;

run;

proc sql;

create table temp as

select sum(COUNT) as total,

sum((y=_INTO_)*COUNT) as correct,

6 as k, 38 as n,

1-calculated correct / calculated total as error format percent7.3

from result

;

quit;

proc import datafile="c:\galaxy.csv" dbms=csv out=galaxy replace; run;

run;

proc g3d data=galaxy;

scatter east_west*north_south=velocity /rotate=25 size=1 noneedle tilt=45;

format _numeric_ 8.2;

run;

quit;

%macro knnreg(k);

ods select none;

proc krige2D data=galaxy outest=knn_pred&k outn=neighbor;

coordinates xc=east_west yc=north_south;

predict var=velocity NP=&k;

model scale=10 range=1E-8 form=exp;

grid gdata=a(drop=velocity) xc=east_west yc=north_south ;

run;

ods select all;

proc g3d data=knn_pred;

plot gxc*gyc=estimate / grid rotate=25 tilt=45

ctop=cx993366

cbottom=cx9999ff

caxis=black;

label gxc="east_west"

gyc="north_south";

format _numeric_ 8.2;

run;

%mend;

%knnreg(5);

%knnreg(48);

data knn_pred;

merge knn_pred5(rename=(estimate=estimate5))

knn_pred48(keep=Estimate rename=(estimate=estimate48));

estimate=(estimate5+estimate48)*0.5;

run;

proc g3d data=knn_pred;

plot gxc*gyc=estimate / grid rotate=25 tilt=45

ctop=cx993366

cbottom=cx9999ff

caxis=black;

label gxc="east_west"

gyc="north_south";

format _numeric_ 8.2;

run;

In data mining and predictive modeling, it refers to a memory-based (or instance-based) algorithm for classification and regression problems. It is a widely used algorithm with many successfully applications in medical research, business applications, etc. In fact, according to Google Analytics, it is the second most viewed article on my SAS programming blog of all time, with more than 2400 views in past year.

In classification problems, the label of potential objects is determined by the labels of closest training data points in the feature space. The determination process is either through "majority voting" or "averaging", potentially with weighting. In "majority voting", the label of object is assigned to be the label which most frequent among the k closest training examples. In "averaging", the object is not assigned a label, but instead, the ratio of each class among the k closest training data points. In Regression problems, the property of object is obtained via a similar "averaging" process, where the value of the object is the average value of the k closest training points.

In practice, both the "majority voting" and "averaging" process can be refined by adding weights to the k closest training points, where the weights are proportional to the distance between object and the training point. In this way, the closest points will have the biggest influence on the final results.

In fact, the SAS implementation of kNN classification has the averaging process be weighted by volume size. Interested readers are encouraged to read the manual for details. kNN algorithm has several characteristics that worth addressing. Understanding its advantages and disadvantages in theory and in practice will help practicioners better leverage this tool. Generally, KNN algorithm has 4 advantages worth noting. It is simple to implement.

Theoretically, kNN algorithm is very simple to implement. The naive version of the algorithm is easy to implement. For every data point in the test sample, directly computing the desired distances to all stored vectors, and choose those shortest k examples among stored vectors. It is, however, computationally intensive, especially when the size of the training set grows. Over the years, many nearest neighbor search algorithms have been proposed, seeking to reduce the number of distance evaluations actually performed. Using an appropriate nearest neighbor search algorithm makes k-NN computationally feasible even for large data sets.

In SAS, the k-d Tree data structure and associated search algorithm of Friedman et. al. is implemented. It is analytically tractable. The error rate for 1-NN classification is twice as much as Bayesian Error in asymptotic sense. It is highly adaptive to local information. kNN algorithm uses the closest data points for estimation, therefore it is able to take full advantage of local information and form highly nonlinear, highly adaptive decision boundaries for each data point. It is easily implemented in parallel. Because it is instance based, for each data point to be scored, the algorithm check against the training table for the k nearest neighbor. Since each data point is independent of the others, the execution of search and score can be conducted in parallel.

On the other hand, kNN algorithm has 3 disadvantages, which will also be addressed in details in section PRACTICAL ISSUES ON IMPLEMENTATION. It is computationally very intensive, because for each records to be classified and/or scored, the program has to scan through available data points to find the nearest neighbors and then determine which ones to use for further calculation. It is demanding in storage, because all records in training sample have to be stored and used whenever scoring is necessary. It is highly susceptible to curse of dimensionality, because as dimension increases, the distance calculation and finding the appropriate space to contain at least one relevant record is becoming difficult.

In order to conduct either KNN Classification or KNN regression in SAS, there are two basic approaches. The first one utilize special options in certain SAS Procedures and conduct the KNN analysis directly. I call it "Direct Approach"; while the second method will first find the nearest data points explicitly leveraging search capability in some SAS procedures and manually calculate the final response values, which is simple anyway. I call it "Manual Approach". To take the manual approach, there are three PROCs that are capable to find nearest neighbor points in efficient way, namely

`PROC FASTCLUS, PROC KRIGE2D`

and `PROC MODECLUS`

. Each has its pro and con and all need data preparation. The discussion of this issue will beyond the scope of this paper. To take the direct approach, `PROC DISCRIM`

can be directly used for KNN Classification while `PROC KRIGE2D`

can be directly used for KNN regression, even though the functionality SAS provides is only basic. For kNN-based classification, `PROC DISCRIM`

offers intuitive programming interface and rich options. For kNN-based regression, even though there is not dedicated procedure available, `PROC KRIGE2D`

can be used to fulfill this task, with some tweaks. In this section, we will demonstrate how to conduct these two data mining tasks in SAS and address some closely related issues. For kNN-based regression, we will discuss the details of how to tweak `PROC KRIGE2D`

. SAS offers kNN classification capability via `PROC DISCRIM`

, by invoking the following procedure options:` METHOD=NPAR K= `

The

`METHOD=NPAR`

option asks SAS to use non-parametric discrimination function, together with `K='`

option `PROC DISCRIM`

will use kNN classification, where `K=`

tells SAS how many neighbors to use in determining the . A typical SAS program for kNN-based classification looks like the follows:

proc discrim data = train ................................[1]method = npar k=5 .............................[2]

test = toscore ..............................[3]

testout = toscore_out...........................[4]

;

class y; ...............................................[5]

var x1-x10; ............................................[6]

run;

We explain the function of each statement below.

Statement [1] tells SAS that \verb|PROC DISCRIM| should be called to process the data named \emph{train}.

Statement [2] tells SAS to apply kNN Classification method using 5 nearest neighbors.

Statement [3] tells SAS that the classification rule be applied to a test data called 'toscore'.

Statement [4] tells SAS to output classification result for the test data and to name the output data as \emph{toscoreout}.

Statement [5] defines Y as the dependent variable to use in \verb|PROC DISCRIM|.

Statement [6] tells SAS to use x1 to x10 as input features in calculating the distance to neighbors.

kNN classification is computationally intensive. Based on SAS manual [4], for the specific tree search algorithm implemented in \verb|PROC DISCRIM|, the time usage, excluding I/O time, is roughly proportional to $$\log(N)*(N*P)$$, where N is the number of observations and P is the number of variables used. kNN is a memory-based method, when an analyst wants to score the test data or new data in production, the whole raw table will be loaded to memory in the scoring process. In computing, \verb|PROC DISCRIM| uses memory approximately proportional to the second order of number of variables, i.e. proportional to $$P^2$$. Therefore, it would be interesting to see how the cost of computing, in terms of memory usage and time consumption, increases as the number of observations in training data and the number of features are increasing. \subsection{kNN Regression using SAS} kNN technique can be applied to regression problems, too, but the coding in SAS is not as straightforward as in a classification problem. kNN regression uses the average value of dependent variable over the selected nearest neighbors to generate predicted value for scoring data point. In SAS, we can use \verb|PROC KRIGE2D| to conduct this analysis, but it needs more cautions and tweaks. The following example code instructs SAS to do a kNN regression on variable Y using 5 nearest neighbors and 2 features: X1 and X2.

```
proc krige2d data=train outest=pred outn=NNlist; ..............[1]
```

` coordinate xc=X1 yc=X2; ..........................................[2] `

` predict var=Y NP=5; .................................................[3]`

` model scale=10 range=1E-7 form=exp; ......................[4]`

` grid griddata=toscore xc=X1 yc=X2; ...........................[5] `

```
run;
```

Statement [1] invokes \verb|KRIGE2D| procedure and ask SAS to operate on data named \emph{train}, and output the prediction to data named \emph{pred} via option OUTEST=. The option OUTN= ask SAS to output information about the selected nearest neighbors of each data points in the data specified in GRID statement. KRIGE2D will directly output the prediction of dependent variable in the data from 'OUTEST= ', but only output the corresponding value of dependent variable in the selected closest data points as well as the coordinates. This is a disadvantage in the sense that when cross validation is conducted, the data from 'OUTN=' does not contain the actual value of dependent variable and this data set has to be merged back to the validation part to obtain actual values so that errors can be computed. Statement [2] tells \verb|KRIGE2D| that the names of the 2 coordinates to be used, or in the language of data mining, the 2 features to be used. In this example, it is X1 and X2.

Statement [3] tells \texttt{KRIGE2D} to use 5 nearest neighbors to predicts dependent variables's value at current grid position. As to be discussed, the number of nearest neighbors is a tuning parameter in this algorithm, and should be determined using data driven approaches, such as Cross Validation.

Statement [4] tells \texttt{KRIGE2D} what model to be used in kriging using 'FORM=' option and the associated parameters, scale and range. Note that, in order to conduct a kNN regression, it is required to specify a large scale value and a small range value. For example, in example above, scale is set to 10 while range is set to 1E-7. Typically, a scale value larger than 1 and a range value smaller than 1E-7 work well. The type of model is irrelevant. The reason why this works is beyond the scope of this paper and we will not discuss on this issue here.

Statement [5] tells \texttt{KRIGE2D} that the name of the data to be scored is \emph{tosore} in option \texttt{GRIDDATA=} and the corresponding names of features (coordinates) to be used for scoring.

\verb|PROC KRIGE2D| was designed to perform ordinary kriging in two dimensions, therefore it is not fully functional in the standard kNN regression practice. There are two major issues with \verb|PROC KRIGE2D| when it is applied in a kNN regression analysis.

The first problem is that \verb|PROC KRIGE2D| accept two and only two inputs as features because it treat kNN regression as a two dimensional kriging. The second problem is due to the nature of semivariogram model used in kriging. In cases where the data are dense and highly overlapped, \verb|PROC KRIGE2D| will report singularity in kriging system. However, these issues should not bother analysis in practice and there are good reasons which will be detailed below.

Overcome Fixed Number of Features in KRIGE2D:

In a real kNN regression application, the number of features ranges from 1 to hundreds if necessary, but \verb|PROC KRIGE2D| requires exactly two features as inputs. When only 1 feature is used, or when more than 2 features are required, the analysts can take a detour by pre-processing the data up front. We will discuss these two cases separately below.

Only one feature. When only one feature is used, we can set up a dummy input, which is a constant number, in both the training data and scoring data. The dummy variable will cancel out in calculating, thus it has no effect virtually. The code below shows a demonstration:

data trainv/view=trainv;

set train;

dummy=1;

run;

data toscorev/view=toscorev;

set toscore;

dummy=1;

run;

proc krige2d data=trainv outest=pred outn=knn;

coordinate xc=dummy yc=feature1;

predict var=Y NP=5;

model scale=10 range=1E-7 form=exp;

grid gdata=toscorev xc=dummy yc=feature1;

run;

In the example code above, we used SAS data view to avoid re-write the data sets and save computing resources. This technique is explained in \cite{view}.

More than two features. When more than two features are used, there are two approaches to handle this problem. The first approach conduct dimension reduction upfront and \verb|PROC KRIGE2D| take two derived new dimensions as features to perform kNN regression; the second approach follows the ensemble principle \cite{ESL}, and randomly select two features out of the pool of available features. The final kNN regression prediction is an average of predictions from each of the ensemble regressions. The first approach can also use the ensemble principle, too, in practice. There are many dimension reduction techniques, and among all, the most widely used is Singular Value Decomposition (SVD), or its equivalent counterpart Principal Components Analysis (PCA). Details of SVD or PCA is beyond the scope of this paper and interested readers can refer to \cite{PCA}, \cite{Matrix}.

The example code in Appendix 1 demonstrates the first method where we conduct dimension reduction using PCA upfront and use the leading two orthogonal bases for kNN regression. Note that in the example code, we first combine both train and score data sets together because PCA is sample sensitive \cite{PCA}, and using records from both training data and scoring data will stabilize the generated orthogonal bases.

The second method is demonstrated below. It is worth noting that when the number of features is small, say 10, it is a good practice to select all pairwise combination as input features and ensemble all results at the final stage. On the other hand, when the number of such combinations blows as the number of features increases, a random sample of pairwise combination works fine, just like in a RandomForest algorithm. In the code below, we assume 100 features which are stored in a table called FEATURES and the ensemble iterates 50 times, each with a pair of randomly selected features.

%macro knnreg(train_dsn, score_dsn, feature1 , feature2 , loopid);

proc krige2d data=&train_dsn outest=pred&loopid outn=nn&loopid;

coord xc=&feature1 yc=&feature2;

pred var=Y NP=10;

model scale=2 range=1E-8 form=exp;

grid gdata=&score_dsn xc=&feature1 yc=&feature2;

run;

%mend;

%let nloops=50;

proc surveyselect data=features out=feature2 reps=&nloops sampsize=2 method=srs;

run;

data _null_;

set feature2;

by replicate;

retain train_dsn 'train' score_dsn 'toscore';

retain _id 0; array _f{2} $ _temporary_;

if first.replicate then do;

call missing(of _f[*]);

_id=1;

end;

_f[_id]=varname; _id+1;

if last.replicate then do;

call execute('%knnreg(' || compress(train_dsn) || ','

|| compress(score_dsn) || ','

|| compress(_f[1]) || ','

|| compress(_f[2]) || ')' );

end;

run;

The '

*call execute*' statement calls the pre-defined macro \%kddreg within a data step. For details about '*call execute*' command, consult \cite{CallExecute}.**. When the scoring point is extremely close to at least one of the training record, local ordinary kriging process will be singular and the estimation for the scoring point will be skipped. One solution is to set a large number for scale and small number for range in MODEL statement, for example:***Singularity in kriging process*MODEL SCALE=10 RANGE=1e-8 FORM=EXP;

In the extreme cases, while inconvenient, it is necessary to manually process the data containing nearest neighboring points from

`OUTN= `

to generate prediction. It turns out to be fairly simple, because ordinary kNN regression only requires average every k observations. This technique is demonstrated in Appendix 2. In such case, it is unnecessary to output predictions from KRIGE2D directly, and the output can be suppressed by specifying `OUTEST=_NULL_`

**. kNN is a proven technique in many classification and regression problems. To fully appreciate its power in real analysis, there are a couple of issues need to be addressed appropriately. These problems touch the pain points of efficiency, accuracy, robustness and more. In the sections below, we will first discuss efficiency around storage and computation issues and see how parallelization can improve the efficiency. Then we will discuss model selection that balances accuracy and robustness. Lastly, we will discuss other miscellaneous issues when applying KNN in practice.***Practical issues on implementation**Optimize Storage and Computation*. kNN algorithm is expensive in both storage and computation. Therefore, it is important to the factors that contribute to complexity in time and space of this algorithm, thus optimize the implementation in SAS.Optimize Storage. kNN classification requires a lot of storage because this is a in-memory algorithm, and all the training information to score a new data will be load into the memory to build a kd-Tree when conduct scoring. Therefore, when the size of the training data is large, and when the number of variables used is large, the required memory storage will be non-trivial. According to SAS/STAT Manual, the required storage for a kNN classification is at least

`c(32v+3l+128)+8v^2+104v+4l `

; bytes, where v is the number of variables, c is the number of classes levels of the dependent variable, and l is the length of CLASS variable. Apparently, the storage requirement is linear in the latter two factors which are given by the problem at hand, while quadratic in the number of variables. To optimize storage, it is necessary to reduce the number of variables involved in the analysis.1. Only choose the necessary variables in kNN analysis. In many cases, 2 or 3 features is enough, actually. On the other hand, like in RandomForest algorithm, multiple runs of kNN algorithm were conducted, each with a small subset of variables, and these classifiers or regression results will be ensembeled to be the final solution.

2. When selected number of variables is still relatively large, one commonly used technique for dimension reduction is Singular Value Decomposition (SVD), or equivalently, the Principal Component Analysis (PCA). Because kNN algorithm is highly adaptive to local information, SVD or PCA won't suffer from common critics that the information in dependent variable is not used when variable reduction is used. This technique is demonstrated in the Digits Recognition example below.

kNN algorithm used in classification and regression suffers a lot from curse of dimensionality, therefore, choosing only necessary and more useful features for the problem in hand is critical for the success of application and implementation.

Optimize Computation. kNN classification is also computationally intensive. Unlike many classification methods, such as logistic regression, discriminant analysis and other parametric regressions, kNN has no condense mathematical form to represent the information in training data, and lack of explicit model training. When scoring is needed, the training data is simply used to populate the sample search space with known results, and scoring is no different from a searching and comparison process, which is very time consuming. In the most naive way of searching, the algorithm compares scoring current data point to all data in the training sample, calculate desired distance metrics and select the top k nearest neighbors for their response information. This is computationally very redundant and intensive and doesn't leverage the distribution information in training sample well. Modern algorithm relies on heuristic searching algorithms and one of the most significant one is kd-Tree \cite{kd-tree}. kd-Tree stands for k-dimensional tree, and it is a space-partitioning data structure to organize data points in k-dimensional space for efficient search. In SAS,

`PROC DISCRIM `

; stores data in kd-Tree structure. The time required to organize the observations into this specific tree structure is proportional to $$vn\ln(n)$$. The time for performing each tree search is proportional to $$\ln(n)$$. Apparently, the way to reduce computational time is reduce the number of data points n. On the other hand, it is advised to keep enough data points in training sample. This is because a k-d trees are not suitable for efficiently finding the nearest neighbor in high dimensional spaces. As a general rule, if the dimensionality is k, the number of points in the data, N, should be N >> 2k. Otherwise, when k-d trees are used with high-dimensional data, most of the points in the tree will be evaluated and the efficiency is no better than exhaustive search, and approximate nearest-neighbor methods are used instead \cite{SAS-STAT}. \subsection{Select the Optimal Values for Tuning Parameters} kNN algorithms have two basic tuning parameters, the number of nearest neighbors k, and the dimension of feature space d. The typical way to find the right values for these two parameters is brutal force computation, where program searches through a grid of different combinations of these two parameters and decide the best set based on cross validation results. The k in kNN refers to the number of closest data points, which is an important tuning parameter in kNN algorithm. There is no set rule for choosing the right k value, but usually when the data is dense, k is set to be small while when data is sparse, a relatively higher k should be used to reduce the variance. In many cases, the dimension is pre-determined. However, kNN is sensitive to curse of dimensionality and more often than not, the features are derived values, such as Principle Component Scores. The analysts have to find which subset of features is enough for the problem at hand. According to Friedman, et al, an objective data driven approach to choose the right value for k is the way to go. Typically, a Cross-Validation (CV) based approach is used, usually 5-fold CV or 10-fold CV. When using an M-fold CV to choose the tuning parameter k and d, it is customized to check the mean value and standard deviation of certain model accuracy measurement. For example, for Classification problem, both error rate and AUC can be used; for regression problems, RMSE is a good candidate measurement. The code in the two examples below illustrate these techniques.*Parallelization*. kNN classification and regression can be easily parallelized. The most significant applications are in Cross-validation based tuning parameter evaluation and scoring. In Cross-Validation process, the analyst is able to open M concurrent sessions, each overs mutually exclusive set of tuning parameters. For example, the simplest case is issuing 2 concurrent SAS sessions, one examines`k=1, ..., 5`

while the other one examines `k=6, ... , 11`

. The actual implementation follows the idea of \cite{MultiThreading}. In scoring, because each data point is assumed to be independent from the others, the scoring program can split the scoring data set into several pieces, depending on the number of available CPUs and the master control module will simultaneously call the scoring module which only operates on each piece at a time.**. In this section, we demonstrate kNN algorithms in real applications. Both examples will go through steps such as data introduction, data analysis and modeling using kNN algorithm, as well as selecting the tuning parameter k. The first example examines handwritten digits recognition, a 10-label classification problem. The second example examines a regression problem where the researcher wanted to estimate the phenolic records.***kNN in Action***. One successful application of kNN classifiers is in digits recognition projects, see \cite{LeCun1998} for examples. In this example, I will demonstrate a complete data analysis process using kNN classifier to identity hand written digits from Optical Digits images. The data is from the pre-processed Optical Digits images available at UC Irving Machine Learning data set repository and can be downloaded at the link here .***kNN Classification in Optical Digits*The data is extracted using the NIST preprocessing routines \cite{NIST} from normalized bitmaps of handwritten digits from a preprinted form. The original bitmaps are 32X32 but divided into non-overlapping blocks of 4X4 and the the number of on pixels are counted in each block. This generates an input matrix of 8x8 where each element is an integer in the range 0..16. This reduces dimensionality to 64 instead of 1024, and gives invariance to small distortions. In this project, the following steps are performed to fine tune the kNN classifier.

First, the training and testing data in CSV format are imported at \emph{Step 1}.

Then a baseline 1-NN classifier using all 64 features is built for comparison purpose at \emph{Step 2}. This baseline classifier return an error rate of 3.73\%.

In \emph{Step 3}, a fine tuned kNN classifier is built, where the tuning parameter k, number of closest points, and d, number features, are selected based on 10-fold Cross Validation.

Besides, instead of using the original 64 features, a Singular Value Decomposition (SVD) was conducted on the training and testing data sets combined to obtain derived features. This data pre-processing technique brings in several advantages.

First, the derived features are ranked by how much variance in the feature space each explained, so that the selection of features becomes simpler using CV. Second, SVD changes original sparse feature space into a dense one, and in general fewer closest data points are required to obtain optimal performance in this case.

In \emph{Step 4}, the selected kNN classifier, with k=4 closest data points and d=36 derived features, is applied to the testing data set, obtaining an error rate of 1.89\%.

As an alternate to 10-fold CV, in \emph{Step 5}, Leave-One-Out CV was used to fine tune the kNN classifier, resulting a set of parameters of k=6 and d=4.

In \emph{Step 6}, the new classifier was applied to the testing data set, resulting in an error rate of 1.4\%. \end{enumerate} The 10-fold Cross Validation process is pretty computationally intensive even for such a small data set. Scanning across 900 different combinations of k, d pairs took almost 25 minutes on a main stream PC. The pay off worth it, however, as the error rate drops to 1.89\% from 3.73\%, a 49\% improvement. On the other hand, using Leave-One-Out Cross Validation process took 22 minutes on the same PC, and the selected parameter pair is k=6, d=38, with an error rate of 2.06\% from the same scoring data set, which is not as good as the result from 10-fold CV but still way better than baseline result.

The 10-fold CV is preferred to Leave-One-Out cross validation from pure computational perspective, as the computing time for the latter one increases quadratically and will become infeasible when the number of observation is more than 100K. Figure \ref{fig:C1} shows the error rate as a function of k and d, while figure \ref{fig:C2} zooms in for a closer look at area where number of features is between 20 and 40 and closest data points is smaller than 5. We can see that globally, the error rate is a relatively smooth function of number of features over all and the number of neighbors does not influence the error rate as much as the number of features does, while locally there is significant variation as shown in figure \ref{fig:C2}. This implies that one heuristic way to scan for the best possible numbers of features is bi-section searching.

The complete SAS code that conducts brute force searching for tuning parameters can be found in Appendix 1. kNN Regression Example. As a demonstration of kNN regression, we used the \emph{Galaxy} data set from the book \emph{Elements of Statistical Learning 2}, which is available at \cite{galaxy}. The galaxy data was used to demonstrate smoothing techniques in the book and was visualized in figure 6.8, where local linear regression with nearest neighbor window of 15\% of the data in $\textrm{R}^2$ was applied to predict velocity data of galaxies.

In our example, a k-Nearest Neighbor prediction is applied, where two values of k are used: k=5 and k=48. k=48 is equivalent to using 15\% of the data as neighbor while k=5 is arbitrarily chosen to demonstrate the fact that smaller number of neighbors will produce more accurate but higher variance prediction while larger value of k will product less accurate but smaller variance hence smoother prediction.

Combine the two predictions in many cases may product better results among all. In figure \ref{fig:galaxy}, 3-D views of actual data and smoothed surface using different smoothing parameters are shown. The upper left panel shows the raw data. Obviously, the galaxy has a star like spatial shape with a total of about 10 arms extending outside from the center. The upper left panel shows smoothed 3-D surface using 5-nearest neighbors while the lower left panel shows smoothed 3-D surface using 48-nearest neighbors. It is noticeable that using less neighbors produce a rougher surface but adaptive better to local where there are more data, such as along the galaxy arms, but in more empty spaces, the prediction is wilder. On the other hand, using 48 neighbors produces much smoother surface yet the galaxy arms are less visible. The lower right panel shows the average of these two surface maps. This averaged out map keeps the galaxy arms clearly visible yet smooth out empty space pretty well.

The complete SAS code is in Appendix 2.

kNN algorithm is an old but very powerful technique if used appropriately. It has been shown that both kNN Classification and kNN regression can be carried out within SAS/STAT very easily. The functions are basic but more than enough to handle many real world problems. Both the theory and practice are discussed, with example SAS code demonstrated. The discussion here, however, is only the tip of iceberg. There are many advanced topics that can't be fitted into this short article. These topics include but not limited to: customized distance function, tangent invariance, etc. Interested readers are encouraged to read the related books and papers.

*Appendix 1.*

/*@@@@ --- Step 1 --- @@@@*/

libname digits "C:\_Data\Digits";

data digits.train;

infile "C:\_Data\Digits\optdigits.tra" lrecl=512 dlm=',';

input x1-x64 y;

run;

data digits.test;

infile "C:\_Data\Digits\optdigits.tes" lrecl=512 dlm=',';

input x1-x64 y;

run;

/*@@@@ --- Step 2 --- @@@@*/

proc discrim data=digits.train testdata=digits.test

testout=testout

method=npar k=1 noprint;

class y;

var x1-x64;

run;

proc freq data=testout;

table y*_INTO_;

run;

/*@@@@ --- Step 3 --- @@@@*/

data all;

set digits.train(in=_1) digits.test(in=_2);

if _1 then Group='T';

else Group='S';

run;

proc princomp data=all out=pca noprint noint cov;

var x1-x64;

run;

proc standard data=pca out=pcaout mean=0 std=1;

var Prin1-Prin62;

run;

/*@@@@@ 10-fold CV approach @@@@@@@@@@*/

data pcaout;

set pcaout;

array _xp{10} p1-p10 (10*0.1);

cv=rantbl(2342342, of _xp[*]);

run;

proc freq data=pcaout; table cv; run;

%macro loopy;

ods select none;

options nonotes nosource;

%do k=1 %to 15;

%do n=2 %to 60;

%put k=&k n=&n;

%do cv=1 %to 10;

proc discrim data=pcaout(where=(group='T' & cv^=&cv))

testdata=pcaout(where=(group='T' & cv=&cv))

testout=testout

method=npar k=&k noprint;

class y;

var Prin1 - Prin&n;

run;

proc freq data=testout;

table y*_INTO_ /out=result;

run;

proc sql;

create table _tmp&cv as

select sum(COUNT) as total,

sum((y=_INTO_)*COUNT) as correct,

&k as k, &n as n,

1-calculated correct / calculated total as error format percent7.3

from result

;

quit;

%end;

data _tmp;

set %do cv=1 %to 10; _tmp&cv %end;;

run;

proc means data=_tmp noprint;

var error;

output out=_tmp mean(error)=error std(error)=error_std;

run;

data _tmp;

length k n 8;

set _tmp;

k=&k; n=&n; drop _TYPE_;

run;

%if %eval(&k+&n=3) %then %do;

data iter; set _tmp; run;

%end;

%else %do;

proc append base=iter data=_tmp force; run;

%end;

%end;

%end;

ods select all; options notes source;

%mend;

dm log 'clear';

%let time0=%sysfunc(time());

%loopy;

%let time1=%sysfunc(time());

data _null_;

dt=(&time1-&time0)/60; put dt= 'Min';

run;

data iter;

set iter; format error percent7.4;

run;

proc transpose data=iter out=itert name=n prefix=n;

by k;

var error;

id n;

run;

data itert;

set itert;

array _n{*} n2-n60;

min=min(of _n[*]); max=max(of _n[*]); format min max percent7.4;

run;

/*@@@@ --- Step 4 --- @@@@*/

/*@@@@ use selected parameter to score the testing data @@@@*/

proc discrim data=pcaout(where=(group='T')) testdata=pcaout(where=(group='S'))

testout=testout(drop=x: Prin:)

method=npar k=4 noprint;

class y;

var Prin1-Prin36;

run;

proc freq data=testout;

table y*_INTO_ /out=result;

run;

proc sql;

create table temp as

select sum(COUNT) as total,

sum((y=_INTO_)*COUNT) as correct,

4 as k, 36 as n,

1-calculated correct / calculated total as error format percent7.3

from result

;

quit;

/*@@@@ --- Step 5 --- @@@@*/

/*@@@@@ Use leave-one-out cross-validation approach @@@@@@*/

%macro loop;

%do k=1 %to 15;

%do i=1 %to 62;

proc discrim data=pcaout(where=(group='T'))

method=npar k=&k noprint crossvalidate

outcross=outcross out=out;

class y;

var Prin1-Prin&i;

run;

proc freq data=outcross noprint;

table y*_INTO_ /out=table;

run;

proc sql;

create table _tmp as

select &k as K, &i as I, sum(PERCENT) as correct

from table

where y=_INTO_

;

quit;

%if %eval(&i+&k)=2 %then %do;

data result; set _tmp; run;

%end;

%else %do;

proc append base=result data=_tmp; run;

%end;

%end;

%end;

%mend;

options nonotes nosource;

%let time0=%sysfunc(time());

dm log 'clear';

%loop;

options notes source;

%let time1=%sysfunc(time());

data _null_;

dt=(&time1-&time0)/60;

put dt= 'Min';

run;

proc transpose data=result out=result_t name=i prefix=n;

by k;

var correct;

id i;

run;

data result_t;

set result_t;

array _n{*} n2-n62;

min=min(of _n[*]); max=max(of _n[*]);

loc_min=whichN(min, of _n[*]); loc_max=whichN(max, of _n[*]);

format min max best.;

run;

proc discrim data=pcaout(where=(group='T'))

testdata=pcaout(where=(group='S'))

testout=testout(drop=x: Prin:)

method=npar k=6 noprint;

class y;

var Prin1-Prin38;

run;

proc freq data=testout;

table y*_INTO_ /out=result;

run;

proc sql;

create table temp as

select sum(COUNT) as total,

sum((y=_INTO_)*COUNT) as correct,

6 as k, 38 as n,

1-calculated correct / calculated total as error format percent7.3

from result

;

quit;

*Appendix 2.*

proc import datafile="c:\galaxy.csv" dbms=csv out=galaxy replace; run;

run;

proc g3d data=galaxy;

scatter east_west*north_south=velocity /rotate=25 size=1 noneedle tilt=45;

format _numeric_ 8.2;

run;

quit;

%macro knnreg(k);

ods select none;

proc krige2D data=galaxy outest=knn_pred&k outn=neighbor;

coordinates xc=east_west yc=north_south;

predict var=velocity NP=&k;

model scale=10 range=1E-8 form=exp;

grid gdata=a(drop=velocity) xc=east_west yc=north_south ;

run;

ods select all;

proc g3d data=knn_pred;

plot gxc*gyc=estimate / grid rotate=25 tilt=45

ctop=cx993366

cbottom=cx9999ff

caxis=black;

label gxc="east_west"

gyc="north_south";

format _numeric_ 8.2;

run;

%mend;

%knnreg(5);

%knnreg(48);

data knn_pred;

merge knn_pred5(rename=(estimate=estimate5))

knn_pred48(keep=Estimate rename=(estimate=estimate48));

estimate=(estimate5+estimate48)*0.5;

run;

proc g3d data=knn_pred;

plot gxc*gyc=estimate / grid rotate=25 tilt=45

ctop=cx993366

cbottom=cx9999ff

caxis=black;

label gxc="east_west"

gyc="north_south";

format _numeric_ 8.2;

run;

Subscribe to:
Posts (Atom)