SAS Programming for Data Mining: April 2010

R is attractive to statistical analysts for its ease of use and ready access of packages implementing modern methodologies. If you have IML, you can submit R commands within SAS/IML enviornment, see Rick's post @ here. Unfortunately, not all analysts have licensed IML. To work around this limitation, I proposed the following technique to submit R statements in a SAS/Base enviornment.

To exchange data between R and SAS, I pushed SAS dataset into a CSV file that can be read by R. Instead, with some more coding, we can leverage the 'sas7bdat' R package to read SAS data directly into R, see Charlie Huang's blog @ here for a demonstration.


%macro RScript(Rscript);
data _null_;
     file "&Rscript";
     infile cards;
     input;
     put _infile_;
%mend;

 
%macro CallR(Rscript, Rlog);
systask command "C:\Progra~1\R\R-2.8.0\bin\R.exe CMD BATCH --vanilla --quiet
                    &Rscript  &Rlog "
        taskname=rjob1  wait  status=rjobstatus1;
%mend;

/****************************/
data a;
     length i 4;
     array _x{100} x1-x100;
     do i=1 to 300;
        do j=1 to dim(_x); _x[j]=rannor(98765); end;
        output;    drop j;
     end;
run;

proc export data=a  outfile="c:\a.csv"  dbms=csv; run;

%RScript(c:\rscript.r)
cards4;
dcsv <- read.csv('c:/a.csv', header=T);
dsvd<-svd(dcsv[,2:101]);
dsvd$u[1:5, 1:5];
dsvd$d[1:8];
;;;;
run;

%CallR(c:/rscript.r, c:/rlog1.txt);

data _null_;
     infile "c:\rlog1.txt";
     input;
     put _infile_;
run;

In Chapter 2 of the book "Pharmaceutical Statistics Using SAS: A Practical Guide" (SAS Press), Prof. Rayens, W and Dr. Johnson K. presented their SAS implementation of boost algorithms, including AdaBoost, RealBoost, GentleBoost and LogitBoost. The original SAS macro can be found at Here.

Their macro uses PROC IML and is preferred by my colleagues since it is not I/O bounded as opposed to my DATA STEP implementation. One shortcoming of their macro is inefficiency because I think it is for DEMONSTRATION purpose rather than for serious applications. The way they wrote the program shows too many redundent calculations as well as inefficient matrix operation. In return, their macro is not able to handle a data set with more than 5000 observations, and on my PC [Intel E6750 2.66GHz, 7.4Gflops/core, 4G memory], it took 1m41s to go over 10 iterations on a data set with 2000 observations and 10 numerical variables and consumed 158MB memory. Note that the resource consumptions increase quadratically in the number of observations. In another experiment with 4000 observations, the macro consumed 752MB memory and took 6m53s to go over 10 iterations on 10 variables. My colleagues asked me if I could make some improvement so that this SAS implementation is usable in industrial applicaitons where data sets with >10K observations and hundreds of features are more than common.

In their implementation, an upper triangle matrix and a lower triangle matrix are used to obtain cumulative weighted sum from either direction of the sorted data, which increases memory consumption and calculation time in quadratic rates. But in PROC IML, there is a built-in function called CUSUM that can be used to obtain cumulative sum lightning fast. In order to use CUSUM function to replace the cumbersome matrix operation, we also need to pay attention to the dimension change. Since there is no huge triangle matrices involves, the consumption on RAM also significantly reduced, which in turn means we can process bigger data [see experiment below]. I did a test on the speed and memory consumption using AdaBoost on 2000 observation and 10 variables, with 10 iterations, before and after the improvement we see:

2000 Obs:

4000 Obs:

Consider the difference showing above, with 2000 observations, new macro used less than 0.3s and merely 1.1MB memory while the original one used 102.17s and 158.3MB memory. With 4000 observations, original macro used 752MB and 6:53s, as comparison, new macro used 0.5s and 1.9MB memory. We see that the memory and time consumption is no longer O(n^2) but rather O(n), where n is the number of observations. Both macro produced almost identical results in two experiments.

With 500K observations, the original macro was not able to proceed but the improved one finished successfully in about 70s and took up only 194MB memory:

With the much reduced processing time, on top of the boost algorithms, analysts are now able to apply more computationally intensive algorithms, such as Bagging.

I also tried to improve the REGSPLIT_IML subroutine. Yesterday, time usage was reduced 69% [1m21s vs 25s in 2K records] comparing to original macro but still increases quadratically with number of observation, memory consumption is much reduced and increase linearly with size of data. With careful study of the calculation and formula involved, I further decomposed the matrix operation into more efficient mathematical calculations, and now the time usage is further reduced to only 0.9% [329.84s vs 2.59s in 4K records] of original macro and increase only linearly with the number of records. Therefore we are practically able to use GentleBoost and LogitBoost [preferred over AdaBoost/RealBoost in many cases] for predictive modeling projects.

First Improvement

Second Improvement:

Using the most recent version of the macro, on a PC with E6320 1.86GHz (5.5Gflps/core) and 4GB memory, we observe the following performance benchmark (another 15% improvement over what's shown above):

Replace SPLIT_IML subroutine with the following code.


/*************************************************************************
This is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 or 3 of the License
(at your option).

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
**************************************************************************/
    start split_iml(x,y,w,g_info,out_type, y_pred);
        n = nrow(x);
        p = ncol(x);
        gini_min = 2;
        gini_var = 0;
        gini_cut = 0;
        y_pred = repeat(0,n,1);
        wsum = sum(w);
        ywsum = sum(y#w);  
        ywsum1 = wsum - ywsum; 
        do j=1 to p;
            x_curr = x[,j]||y||w;
            b=x_curr;  
            x_curr[rank(x[,j]),]=b;  free b;
            x_sort = x_curr[,1]; 
            y_sort = x_curr[,2]; 
            w_sort = x_curr[,3];
            yw_sort=(y_sort#w_sort);
            yw_sort1=(w_sort - yw_sort);        
            yw_cusum=cusum(yw_sort[1:(n-1)]);    

            lpwt = cusum(w_sort[1:(n-1)]);
            lpwt = lpwt#(lpwt >= 2*CONSTANT('SMALL')) + 
                   (lpwt < 2*CONSTANT('SMALL'))*2*CONSTANT('SMALL');
    
            p1_L = yw_cusum # (1/lpwt);
            gini = yw_cusum # (1-p1_L);
                       
            rpwt = wsum - lpwt; 
            rpwt = rpwt#(rpwt >= 2*CONSTANT('SMALL')) + 
                   (rpwt < 2*CONSTANT('SMALL'))*2*CONSTANT('SMALL');
    
            yw_cusum = ywsum - yw_cusum;
            p1_R = yw_cusum # (1/rpwt);
            
            gini = gini + yw_cusum # (1-p1_R);

            free lpwt  rpwt  yw_cusum  yw_sort1;

            g_min=gini[><];  g_loc=gini[>:<];

            if g_min < gini_min then do;
                gini_min=g_min;
                gini_var = j;
                gini_cut = (x_sort[g_loc] + x_sort[g_loc+1]) / 2;
                p1_RH = p1_R[g_loc];
                p0_RH = 1-p1_R[g_loc];
                p1_LH = p1_L[g_loc];
                p0_LH = 1-p1_L[g_loc];

                c_R = 0;
                if p1_RH > 0.5 then c_R = 1;
                c_L = 0;
                if p1_LH > 0.5 then c_L = 1;
            end;
        end;
        g_info = gini_var||gini_min||gini_cut||p0_LH||p1_LH||c_L||p0_RH||p1_RH||c_R;
        if out_type = 1 then 
           y_pred = (x[, gini_var] <=gini_cut)*c_L + 
                    (x[, gini_var] > gini_cut) *c_R
        ;
       
        if out_type=2 then
           y_pred[, 1] =( x[, gini_var]<=gini_cut) * ( (c_L=0)*(1-p0_LH) + (c_L=1)*p1_LH) +
                        ( x[, gini_var] >  gini_cut) * ( (c_R=0)*(1-p0_RH) + (c_R=1)*p1_RH)
        ;
 
 
    finish split_iml;

Replace REGSPLIT_IML subroutine with the following code.


/*************************************************************************
This is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 or 3 of the License
(at your option).

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
**************************************************************************/
start regsplit_iml(x,y,w,j_info,y_pred);
        n = nrow(x);
        p = ncol(x);
        min_css = 10000000000000;
        y_pred = repeat(0,n,1);
  wy2sum = sum( w#y#y );
        wsum = sum(w);
        ywsum = sum(y#w);
  ywsum1 = wsum - ywsum;
        do j=1 to p;
            x_curr = x[,j]||y||w;
            b=x_curr;
            x_curr[rank(x[,j]),]=b;   free b;
            x_sort = x_curr[,1];
            y_sort = x_curr[,2];
            w_sort = x_curr[,3];

   yw_sort=(y_sort#w_sort);
   yw_sort1=((1-y_sort)#w_sort);
   w_sort = (w_sort);

   yw_cusum = cusum(yw_sort[1:(n-1)]);

   lpwt = cusum(w_sort[1:(n-1)]);
   lpwt = lpwt# (lpwt>constant('SMALL')) + 
                         constant('SMALL')#(lpwt<=constant('SMALL'));
   p1_L = (yw_cusum # (1/lpwt));

   rpwt = wsum - lpwt;
   rpwt = rpwt#(rpwt>constant('MACEPS')) + 
                         constant('MACEPS')#(lpwt<=constant('MACEPS'));
   p1_R = ((ywsum - yw_cusum) # (1/rpwt)); 

   css=(1:n-1)*0;
   lpwt = cusum(w_sort); rpwt = cusum(yw_sort);

   css = wy2sum + p1_L##2#lpwt[1:(n-1)] + p1_R##2#(wsum - lpwt[1:(n-1)]) -
           2*(p1_L#rpwt[1:(n-1)] + p1_R#(ywsum - rpwt[1:(n-1)]));

   free  lpwt  rpwt  yw_cusum  yw_sort1;
   css_min=css[><];  css_loc=css[>:<];

            if css_min < min_css then do;
                min_css = css_min;
                cut_val = (x_sort[css_loc] + x_sort[css_loc+1]) / 2;
                reg_var = j;
                ypred_L = (sum(yw_sort[1:css_loc]))/sum(w_sort[1:css_loc]);
                ypred_R = (sum(yw_sort[css_loc+1:n]))/
                        sum(w_sort[css_loc+1:n]);
                y_pred = ypred_L*(x[,j] < cut_val) + ypred_R*(x[,j] >= cut_val);
                j_info = reg_var||min_css||cut_val||ypred_L||ypred_R;
            end;
        end;
    finish regsplit_iml;

Reference:
Dmitrienko, Alex, Christy Chuang-Stein, and Ralph D’Agostino. Pharmaceutical Statistics Using SAS®: A Practical Guide. Cary, NC: SAS Institute Inc. 2007

Pharmaceutical Statistics Using SAS: A Practical Guide (SAS Press)

SAS Programming for Data Mining

Page Title

Friday, April 30, 2010

Conduct R analysis within SAS

Wednesday, April 21, 2010

Improve the Boost macro from Prof. Rayens, W and Dr. Johnson, K

Pageviews last month