PageTabs

Monday, January 30, 2012

Random Number Seeds: NOT only the first one matters!


Today, Rick (blog @ here) wrote an article about random number seed in SAS to be used in random number functions in DATA Step. Rick noticed when multiple random number functions are called using different seeds, only the first one matters.

This is so true. In fact, SAS Manual also has a comprehensive writting on this issue, namely, how to control seeds for each iteration of the random number generation process and how to generatie multiple statistically independent streams of random numbers, see here. In fact, sasCommunity.org also has an article about this issue, see here.

To echo Rick's post,  there is a way to control the seed so that NOT only the first one matters: use CALL routines which by theory will generate computationally independent random number sequence. But if the two seeds are too close, the generated sequences may not be statistically independent. Again, refer to the SAS manual for details.




data normal;
   seed1 = 11111;
   seed2 = 22222;
   seed3 = 383333;
   do i = 1 to 1000;
      call rannor(seed1, x1);
      call rannor(seed2, x2);   
      call rannor(seed3, x3);
   x4=rannor(seed2);
   x5=rannor(seed3);
      output;
   end;
run;

proc sgscatter data=normal;
   matrix x1-x5/ markerattrs = (size = 1);
run; 
 

Wednesday, November 16, 2011

Using PROC CANCORR to solve large scale PLS problem



Partial Least Square (PLS) is a powerful tool for discriminant analysis with large number of predictors [1].

PLS extracts latent factors that maximize the covariance between independent variables and dependent variables. This process is equivalent to Generalized Eigenvalue Decomposition of the following formula [2]:
$$X'HXw =\phi X'Xw $$. For PLS $$H=Y'Y$$ Note that Canonical Correlation Analysis (CCA) follows the same generalized eigenvalue decomposition problem, specifically, for CCA, $$H=Y'(YY')^{-1}Y$$.

In SAS, PROC PLS implements 2 forms of PLS, namely the original NIPALS [2] and SIMPLS [3]. When there is only one dependent variable, the two algorithms generate the same output.PLS is a computationally very demanding algorithm. While powerful, when the dimension of the problem at hand becomes very large, PROC PLS will encounter issues such as insufficient memory and very long computing time. There is a rescue when only one dependent variable Y presents. In this case, CCA and PLS differ only up to a fixed scale parameter. Therefore we can use PROC CANCORR, which is very scalable and multithreaded, to solve the PLS problem. The obtained weights and loadings will not be the same but the difference is only up to a fixed scale parameter.

In the follow log, we demonstrate the behavior of PROC PLS and PROC CANCORR on a server with 4GB accessible memory when the number of independent variable is 5000 and sample size is 100K. PROC PLS reported insufficient memory and stopped computing in 45seconds after exhausting all accessible memory, while PROC CANCORR continued and finished computation in slightly more than 7 minutes. Both procedures used up 3.92GB memory available to SAS from the system. Also note that PROC CANCORR used more than 33minutes of CPU time, indicating its very good scale up capability in a multi-core environment.

Referece:
[1] Barker, M and Rayens, W (2003), "Partial Least Squares for Discrimination", Journal of  Chemometrics, 17, 166-173

[2] Sun, L; Ji, S; Yu, S; and Ye, J (2009), "On the Equivalence Between Canonical Correlation Analysis and Orthonormalized Partial Least Squares", In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI 2009).

[3] Wold, H. (1966), “Estimation of Principal Components and Related Models by Iterative Least Squares,” in P. R. Krishnaiah, ed., Multivariate Analysis, New York: Academic Press.

[4] de Jong, S. (1993), “SIMPLS: An Alternative Approach to Partial Least Squares Regression,” Chemometrics and Intelligent Laboratory Systems, 18, 251–263.



NOTE: PROCEDURE PRINTTO used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      Memory                            3920274k
      OS Memory                         3926280k
      Timestamp            11/16/2011  2:11:22 PM
      Page Faults                       0
      Page Reclaims                     9
      Page Swaps                        0
      Voluntary Context Switches        1
      Involuntary Context Switches      1
      Block Input Operations            0
      Block Output Operations           0
      

16         options fullstimer;
17         data x;
18              length id y x: 8;
19           array x{5000};
20              do id=1 to 1E5;
21              y=rannor(0);
22              do j=1 to dim(x);
23              x[j]=rannor(0);
24           end;
25           output;
26           drop j;
27           end;
28         run;

NOTE: The data set WORK.X has 100000 observations and 5002 variables.
NOTE: Compressing data set WORK.X increased size by 0.06 percent. 
      Compressed is 100066 pages; un-compressed would require 100009 pages.
NOTE: DATA statement used (Total process time):
      real time           1:07.58
      user cpu time       1:02.41
      system cpu time     5.16 seconds
      Memory                            3920274k
      OS Memory                         3926280k
      Timestamp            11/16/2011  2:12:29 PM
      Page Faults                       0
      Page Reclaims                     510
      Page Swaps                        0
      Voluntary Context Switches        63
      Involuntary Context Switches      107
      Block Input Operations            0
      Block Output Operations           0
      

29         
30         
31         proc pls data=x method=simpls noprint;
32              model y =x1-x5000;
33         run;

ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: There were 100000 observations read from the data set WORK.X.
NOTE: PROCEDURE PLS used (Total process time):
      real time           45.89 seconds
      user cpu time       40.11 seconds
      system cpu time     5.77 seconds
      Memory                            3920284k
      OS Memory                         3926280k
      Timestamp            11/16/2011  2:13:15 PM
      Page Faults                       0
      Page Reclaims                     978075
      Page Swaps                        0
      Voluntary Context Switches        91
      Involuntary Context Switches      162
      Block Input Operations            0
      Block Output Operations           0
      
35         proc cancorr data=x noprint;
36              var y;
37           with x1-x5000;
38         run;

NOTE: PROCEDURE CANCORR used (Total process time):
      real time           7:02.85
      user cpu time       33:13.85
      system cpu time     4.40 seconds
      Memory                            3920284k
      OS Memory                         3926280k
      Timestamp            11/16/2011  2:20:18 PM
      Page Faults                       0
      Page Reclaims                     126339
      Page Swaps                        0
      Voluntary Context Switches        2096
      Involuntary Context Switches      83359
      Block Input Operations            0
      Block Output Operations           0
      

39    
40         proc printto; run;

Thursday, October 06, 2011

Obtain Trace of the Projection Matrix in a Linear Regression

Recently, I am working on coding in SAS for a set of regularized regressions and need to compute trace of the projection matrix:
$$ S=X(X'X + \lambda I)^{-1}X' $$.

Wikipedia has a well written introduction to Trace @ here.

To obtain the inverse of matrix (X'X + \lambda I) in SAS/STAT, there are multiple ways:
1. Build the SSCP matrix first, then inverse it using the following methods:
    1.1. Using SVD based method, shown @ here but not demonstrated in this post;
    1.2. Using PROC REG, shown @ here , and also demonstrated in the code below;
    1.3. Using SWEEP operator, shown @ here under Method 0 and Method 1;
2. Since (X'X + \lambda I) is the SSCP for a Ridge Regression with ridge parameter=\lambda, it is handy to directly use the ODS OUTPUT InvXPX= statement to obtain the inversed matrix, when X is appended by a diagonal matrix of \lambda*I_{p, p}. SAS code is demonstrated below;

Trace of the project matrix S is a key concept in modern regression analysis. For example, the effective degree of freedom of the model in a regularized linear regression is trace(S)/N, see [1] for details. For another example, in approximating leave-one-out-cross-validation using GCV, trace(S)/N is a key component (formula 7.52 of ~[1]).

Check the reference and the cited works therein for more information.

Ref:
1. Hastie et al., The Elements of Statistical Learning, 2nd Ed.


/* Use the SocEcon example data created in
   Example A.1: A TYPE=CORR Data Set Produced by PROC CORR
   on page 8153 of SAS/STAT 9.22 User Doc
*/
data SocEcon;
input Pop School Employ Services House;
datalines;
5700 12.8 2500 270 25000
1000 10.9 600 10 10000
3400 8.8 1000 10 9000
3800 13.6 1700 140 25000
4000 12.8 1600 140 25000
8200 8.3 2600 60 12000
1200 11.4 400 10 16000
9100 11.5 3300 60 14000
9900 12.5 3400 180 18000
9600 13.7 3600 390 25000
9600 9.6 3300 80 12000
9400 11.4 4000 100 13000
;
run;

%let depvar=HOUSE;
%let covars=pop school employ services;
%let lambda=1;
/* Below is the way to obtain trace(S), where S is the project matrix in a (regularized) linar regression. 
   For further information, check pp.68, pp.153 of Elements of Statistical Learning,2nd Ed.
*/

/* For details about TYPE=SSCP special SAS data, consult:
  Appendix A: Special SAS Data Sets, pp.8159 of SAS/STAT 9.22 User's Guide
*/
proc corr data=SocEcon sscp out=xtx(where=(_TYPE_='SSCP'))  noprint;
     var &covars;
run;


data xtx2;
     set xtx;
  array _n{*} _numeric_;
  array _i{*} i1-i5 (5*0);
  do j=1 to 5;
     if j=_n_ then _i[_n_]=λ
  else _i[j]=0;
  end;
  _n[_n_]=_n[_n_]+1;
  drop j _TYPE_  _NAME_;
run;

/* Obtain the inverse of (XTX+\lambda*I)
   Note that we explicitly specified Intercept term in the 
   covariate list and fit a model without implicit intercept 
   term in the model.
*/
proc reg data=xtx2  
         outest=S0(type=SSCP
                   drop=i1-i5 _MODEL_  _DEPVAR_  _RMSE_)
         singular=1E-17;
     model i1-i5 = Intercept &covars / noint   noprint;
run;quit;

data S0;
     set S0; 
  length _NAME_ $8;
  _NAME_=cats('X_', _n_);
run;

proc score data=SocEcon  score=S0  out=XS0(keep=X_:)  type=parms;
     var &covars;
run;     

data XS0X;
     merge XS0  X;
  array _x1{*} X_:;
  array _x0{*} intercept pop school employ services;
  do i=1 to dim(_X1);
     _x1[i]=_x1[i]*_x0[i];
  end;
  rowSum=sum(of _x1[*]);
  keep rowSum;
run;
proc means data=XS0X  noprint;
     var rowSum;
  output out=trace  sum(rowSum)=Trace;
run;


Verify the result using R:


> socecon<-read.csv('c:/socecon.csv', header=T)
> x<-as.matrix(cbind(1, socecon[,-5]))
> xtx<-t(x)%*%x
> phi<-xtx+diag(rep(1, 5))
> 
> # method 1. 
> S<-x%*%solve(phi)*x
> sum(S)
[1] 4.077865
> # method 2. 
> S<-(x%*%solve(phi))%*%t(x)
> sum(diag(S))
[1] 4.077865
>

Of course, except for method 1.2 shown above, we can also use Method 2 mentioned above, and obtain the same inverse matrix:


data SocEcon2 /view=SocEcon2;
     set x  end=eof;
  array x{5} Intercept &covars (5*0);
  Intercept=1;
  output;
  if eof then do;
     do j=1 to dim(x);
     x[j]=λ
     output;
     drop j;
     x[j]=0;
  end;
  end;
run;

ods select none;
ods output InvXPX=S1;
proc reg data=SocEcon2  singular=1E-17;
     model y = Intercept &covars /noint  i;
run; quit;
ods select all;


As a side point, it is also of interests to compare the computational performance of both illustrated methods. SVD-based approach and SWEEP operator based approach are omitted.

Using a SAS data set of 1001 covariates (including Intercept term) and 1E6 observations, total 7.6GB on a windows PC, the test was done on two dramatically different machines: one WindowsXP laptop with mediocre HDD and a 2-core T7300 CPU; the other one is a high-end Server running Linux64 with Disk Array and fast CPUs totaled 16 cores.

On the PC,:
-> Using method 1.2, the time used decomposition is listed below:
PROC CORR:  real time: 25:47.72; CPU time: 15:26.15
DATA step on XTX: real time: 2.28 sec; CPU time: 0.07 sec
PROC REG :  real time: 15.45 sec; CPU time: 27.31 sec;
Total real time: 26:05.45

-> Using method  2, the time used is:
PROC REG: real time: 48:28.44; CPU time: 1:16:46.41

On the server:
 -> Using method 1.2, the time used decomposition is listed below:
PROC CORR: real time: 5:40.61; CPU time: 5:40.58
DATA step on XTX: real time: 0.05 sec; CPU time: 0.05 sec
PROC REG: real time 1.71 sec; CPU time: 5.27 sec
Total real time: 5:42.37

-> Using method 2, the time used is:
PROC REG: real time: 6:01.46; CPU time: 19.13.49

The performance is summarized below:

Thursday, August 18, 2011

Benchmark Regression Procedures using OLS Regression

Rick Wicklin discussed in his blog the performance in solving a linear system using SOLVE() function and INV() function from IML.

Since regression analysis is an integral part of SAS applications and there are many SAS procedures in SAS/STAT that are capable to conduct various regression analysis, it would be interesting to benchmark their relative performance using OLS regression, the fundamental regression analysis of all.

The analysis will compare REG, GLMSELECT, GENMOD, MIXED, GLIMMIX, GLM, ORTHOREG, HPMIXED and TRANSREG on 10 OLS regressions with 100 to 1000 variables, incremental at 100, and with the number of observations twice the number of variables to avoid possible numerical issues. HPMIXED uses sparse matrix techniques and will be put into great disadvantage in this comparison using large dense matrices. A macro wraps them together:



%macro wrap;
proc printto log='c:\testlog.txt';run;

%let t0=%sysfunc(datetime(), datetime.);
%let procnames=GLM REG GLMSELECT ORTHOREG MIXED GLIMMIX GENMOD ;
%let nproc=%sysfunc(countW(&procnames));
%put Test &nproc PROCEDURES;
%do i=1 %to 10;
    %let nobs=%sysevalf(&i*100);
    options nonotes;
    data _temp;
         array x{&nobs};
	 do i=1 to 2*&nobs;
	    do j=1 to &nobs;
	       x[j]=rannor(0);
            end;
	    y=rannor(0);
	    drop i j;
	    output;
	 end;		 
     run;
     options notes;
     sasfile _temp load;
     ods select none;

     %do j=1 %to &nproc;
         %let proc=%scan(&procnames, &j);
	 %put &proc;
	 proc &proc data=_temp;
	      model y = x1-x&nobs;
	 run;
    %end;
    %put TRANSREG;
    proc transreg data=_temp;
         model identity(y) = identity(x1-x&nobs);
    run;
    sasfile _temp close;
    ods select all;
%end;
proc printto; run;
%mend;
%wrap;

After running all iterations, the SAS log is parsed to obtain procedure names and corresponding real time and CPU time. The following SAS code does this job:



data proc_compare;
     infile "c:\testlog.txt";
	 input;
	 retain procedure ;
	 retain realtime cputime  realtime2 ; 
	 length procedure $12.;
	 length realtime  cputime $24.;
	 if _n_=1 then id=0;
	 x=_infile_;
	 if index(x, 'PROCEDURE')>0 then do;
	    procedure=scan(_infile_, 3);		
		if procedure="REG" then id+1;		
	 end;
	
	 if index(x, 'real time')>0 then do;
	    _t1=index(_infile_, 'real time');
	    _t2=index(_infile_, 'seconds');
	    if _t2=0 then _t2=length(_infile_);
            realtime=substr(_infile_, _t1+9, _t2-_t1-9);
	    if index(realtime, ':')>0 then do;
 	       realtime2=scan(realtime, 1, ':')*60;
	       sec=input(substr(realtime, index(realtime, ':')+1), best.);
	       realtime2=realtime2+sec;		 
	    end;
	    else realtime2=input(compress(realtime), best.);
	 end;
	 if index(x, 'cpu time')>0 then do;
	    _t1=index(_infile_, 'cpu time');
	    _t2=index(_infile_, 'seconds');
	    if _t2=0 then _t2=length(_infile_);
	    cputime=substr(_infile_, _t1+8, _t2-_t1-8);
	    if index(cputime, ':')>0 then do;
 	       cputime2=scan(cputime, 1, ':')*60;
	       sec=input(substr(cputime, index(cputime, ':')+1), best.);
	       cputime2=cputime2+sec;
	    end;
	    else cputime2=input(compress(cputime), best.);
	    keep id size  procedure cputime2 realtime2 ;
	    size=id*100;
	    if compress(procedure)^="PRINTTO" then output;
	end;
run;

We then visualize the results using the following code:


title "Benchmark Regression PROCs using OLS";
proc sgpanel data=proc_compare;
     panelby procedure /rows=2;
     series y=cputime2  x=size/ lineattrs=(thickness=2);
	 label cputime2="CPU Time (sec)"
	       size="Problem Size"
		   ;;	
	 colaxis grid;
	 rowaxis grid;
run;
title;

title "Closer Look on REG vs. GLM vs. GLMSELECT";
proc sgplot data=proc_compare  uniform=group;
     where procedure in ("GLMSELECT", "REG", "GLM");
     series x=size y=cputime2/group=procedure  curvelabel lineattrs=(thickness=2);
	 label cputime2="CPU Time (sec)"
	       size="# of Variables"
		   ;;
     yaxis grid ;
     xaxis grid ;
run;
title;






It is found that PROC GLM and GLMSELECT beat all other procedures with large margin while HPMIXED is the slowest followed by GLIMMIX. Surprisingly, REG is slower than both GLM and GLMSELECT even though it utilized multi-threading technique while GLMSELECT does not:

************ Partial LOG of the last iteration ********
NOTE: PROCEDURE REG used (Total process time):
real time 6.79 seconds
cpu time 9.36 seconds


NOTE: There were 2000 observations read from the data set WORK._TEMP.
NOTE: PROCEDURE GLMSELECT used (Total process time):
real time 3.06 seconds
cpu time 2.96 seconds
********************************************************

The performance gap between REG and GLM/GLMSELECT is getting larger when the number of variables increases to be more than 700.

Both REG and GLMSELECT are developed by the same group of developers in SAS, as far as I know.

********************* PS : ****************************
Rick and Charlie pointed out that real time is a more fair measure, which I agree.

The reading of real computing time has large variance from run to run because the testing enviornment is not very clean and there are many background window programs running. Below is part of the log file of another run with 2000 variables and 4000 records:

NOTE: PROCEDURE REG used (Total process time):
      real time           2.26 seconds
      cpu time            7.76 seconds
     


NOTE: PROCEDURE GLM used (Total process time):
      real time           3.57 seconds
      cpu time            4.58 seconds


NOTE: There were 2000 observations read from the data set WORK._TEMP.
NOTE: PROCEDURE GLMSELECT used (Total process time):
      real time           3.50 seconds
      cpu time            3.44 seconds
     
We see that REG has lower real time comparing to GLM/GLMSELECT, even though cpu time is about twice the average of GLM/GLMSELECT. In a case where BY-processing is used, GLMSELECT will use multi-threading as specified in PERFORMANCE statement, and the gap in real time between REG and GLMSELECT will be eliminated. In a collaborating environment, more CPU time also means competing for more resources. Below we show the real time of the same run as in above CPU Time figure.


Note that CPU Time difference and pattern is pretty consistent.Below is the mean CPU Time and its 90% C.I. of 100 runs using REG /GLM /GLMSELECT on different size of problems.


Wednesday, August 10, 2011

Rolling Window Regression of Time Series

More often than not, we encounter a problem where an OLS over a rolling time window is required, see [1], [2], [3], [4], [5], [6], [7], for a few examples.

One solution is to resort to SAS MACRO, but it is extremely inefficient and can't handle large dataset in reality, [8]. This method is shown below as Method[4]. It couldn't finish the test in allowable time using the sample data below.

The other common solution is to use the BY-processing capability of PROC REG after re-shaping the data into appropriate format, see [9], [10]. This method is demonstrated as Method[3] below. While certainly much better than above one, it is still not the fastest and requires more memory.

The third solution comes to play by recognizing that in OLS, what you need is the SSCP and you can easily build up the SSCP over rolling time window by resorting to PROC EXPAND. This is demonstrated as Method[2] below. This approach will further improve the speed but still requires large amount of memory if the data is big and many rolling windows are generated.

Since what we need to do is to build the SSCP matrix and obtain the coefficient estimates based on the informaiton in SSCP, we can certainly code this in a DATA Step using ADJUST operator, which provides a solution that is both fast and low memory occupancy. See [11] for an introduction to ADJUST operator. To make this even faster, a modification of ADJUST operator, the SWEEP operator, can be used. For an introduction to SWEEP operator, see [11], [12]. In the code below, Method[0] implements the ADJUST operator, while Method[1] implements the SWEEP operator.

The experiment results are shown below:

                         Real Time    |        CPU Time          | Memory                   
=====================================================

Method 0 |    1.01 (seconds)   |    1.01 (seconds)        |    611K


Method 1 |    0.25 (seconds)   |    0.24 (seconds)        |    432K


Method 2 |    1.61 (seconds)   |    0.94 (seconds)        | 50381K


Method 3 |  80.54 (seconds)   |   79.61 (seconds)       |  2322K


Method 4 |         Failed           |           Failed               |    Failed
=====================================================

Reference:
[1]MYSAS.NET, http://www.mysas.net/forum/viewtopic.php?f=4&t=8070
[2]MYSAS.NET, http://www.mysas.net/forum/viewtopic.php?f=4&t=7898
[3]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0604D&L=sas-l&P=R32485
[4]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0704C&L=sas-l&P=R3305
[5]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0802C&L=sas-l&P=R9746
[6]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0801C&L=sas-l&P=R14671
[7]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0810A&L=sas-l&P=R19135
[8]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0802C&L=sas-l&P=R13489
[9]Michael D Boldin, "Programming Rolling Regressions in SAS", Proceedings of NESUG, 2007
[10]SAS-L, http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0604D&L=sas-l&D=0&P=56926
[11]J. H. Goodnight, "The Sweep Operator: Its Importance in Statistical Computing", SAS Tech Report R-106, 1978
[12]Kenneth Lange, "Numerical Analysis for Statisticians", Springer, 1998


proc datasets library=work kill; run;


options fullstimer;
data test;
     do seq=1 to 500000;
          x1=rannor(9347957);
          *x2=rannor(876769)+0.1*x1;
          epsilon=rannor(938647)*0.5;
          y = 1.5 + 0.5*x1 +epsilon;
          output;
     end;
run;

/* Method 0.*/
sasfile test load;
data res0;
        set test;
  array _x{3,3} _temporary_ ;
  array _a{3,3} _temporary_ ;
  array _tempval{5, 20} _temporary_ ;
  m=mod(_n_-1, 20)+1;
  _tempval[1, m]=x1; 
  _tempval[2, m]=y;
  _tempval[3, m]=x1**2;
  _tempval[4, m]=x1*y;
  _tempval[5, m]=y**2;

  link filler;
  if _n_>=20 then do;
        if _n_>20 then do; 
                   m2=mod(_n_-20, 20)+1;
       _x[1,2]+(-_tempval[1, m2]);
       _x[1,3]+(-_tempval[2, m2]);
       _x[2,2]+(-_tempval[3, m2]);
       _x[2,3]+(-_tempval[4, m2]);
       _x[3,3]+(-_tempval[5, m2]);
     end;
        do i=1 to dim(_a, 1);
         do j=1 to dim(_a, 2);
          _a[i, j]=_x[i, j];
      end;
     end;
            
              do k=1 to dim(_a, 1)-1;
         link adjust;
              end;
     Intercept=_a[1,3]; beta=_a[2,3];
     keep seq   intercept  beta;
     output;
  end;

  return;
filler:
   _x[1,1]=20; _x[1,2]+x1; _x[1,3]+y;
   _x[2,2]+_tempval[3,m];  _x[2,3]+_tempval[4,m]; _x[3,3]+_tempval[5,m];
   _x[2,1]=_x[1,2]; _x[3,1]=_x[1,3]; _x[3,2]=_x[2,3]; 
return;

adjust:
    B=_a[k, k];
 do j=1 to dim(_a, 2);
     _a[k, j]=_a[k, j]/B;
 end;
 do i=1 to dim(_a, 1);
     if i ^=k then do;
          B=_a[i, k];
    do j=1 to dim(_a, 2);
        _a[i, j]=_a[i, j]-B*_a[k, j];
    end;
  end;
 end;
return;

run;
sasfile test close;



/* Method 1.*/

sasfile test load;
data rest0;
        set test;
  array _x{4} _temporary_;
  array _a{2,20}  _temporary_;
  m=mod(_n_-1, 20)+1;
  _a[1, m]=x1; _a[2,m]=y;
  link filler;

  m2=mod(_n_-20, 20)+1;
  if _n_>=20 then do;
    if _n_>20 then do;
              link deduct;
    end;
    beta=(_x[2]-_x[1]*_x[4]/20)/(_x[3]-_x[1]**2/20);
    intercept=_x[4]/20 - beta*_x[1]/20;
    keep  seq   intercept  beta  ;
    output;
  end;
  return;       
filler:
     _x[1]+x1;
  _x[2]+x1*y;
  _x[3]+x1**2;
  _x[4]+y;
return;
deduct:
     _x[1]=_x[1]-_a[1,m2]; 
  _x[2]=_x[2]-_a[1,m2]*_a[2,m2];
  _x[3]=_x[3]-_a[1,m2]**2;
  _x[4]=_x[4]-_a[2,m2];
return;
run;
sasfile test close;



/* Method 2.*/

%macro wrap;
%let window=20;
%let diff=%eval(&window-0);
data testv/view=testv;
     set test;
       xy=x1*y;  
run;

proc expand data=testv  method=none  out=summary(keep=seq sumxy  sumx1  sumy  ussx1  may  max);
       convert  x1=sumx1/transformout=(movsum &diff);
       convert  xy=sumxy/transformout=(movsum &diff);
       convert  x1=ussx1/transformout=(movuss &diff);
       convert  y =sumy /transformout=(movsum &diff);
       convert  y =may / transformout=(movave &diff);
       convert  x1 =max / transformout=(movave &diff);  
run;

data result1;
     set summary(firstobs=&window);
       beta = (sumxy - sumx1*sumy/&window)/(ussx1 - sumx1/&window.*sumx1);  
       alpha= may - beta*max;
       keep seq  beta  alpha; 
run;
%mend;

%let t0=%sysfunc(datetime(), datetime24.);
*options nosource nonotes;
%wrap;
options source notes;
%let t1=%sysfunc(datetime(), datetime24.);
%put Start @ &t0;
%put End   @ &t1;

 

/* Method 3.*/
%let t0=%sysfunc(datetime(), datetime.);
 
data test2v/view=test2v;
       set test;
       array _x{2, 20} _temporary_ (20*0 20*0);
       k=mod(_n_-1, 20)+1;
       _x[1, k]=x1; _x[2, k]=y;
       if _n_>=20 then do;
          do j=1 to dim(_x, 2);
               x=_x[1, j]; y=_x[2, j];
               output;
               keep seq x y;
            end;
       end;
run;

ods select none;
proc  reg data=test2v  outest=res2(keep=seq x intercept);
         by seq;
         model y = x;
run;quit;
ods select all;

%let t1=%sysfunc(datetime(), datetime.);
%put Start @ &t0;
%put End   @ &t1;


/* Method 4. */
%macro wrap;
options nonotes;
ods select none;
%do i=20 %to 500000;
       %let fo=%eval(&i-19);
       proc reg data=test(firstobs=&fo  obs=&i)  outest=_xres(keep=x1 intercept);
           model y =x1;
       run;quit;
      %if %eval(&i=20) %then %do;
          data res3; set _xres; run;
      %end;
      %else %do;
        proc append base=res3  data=_xres; run;
      %end;
%end;

ods select all;
data res3;
       set res3;
       time=19+_n_;
run;
options notes;
%mend;

%let t0=%sysfunc(datetime(), datetime.);
%wrap;
%let t1=%sysfunc(datetime(), datetime.);
%put Start @ &t0;
%put End   @ &t1;

Tuesday, June 21, 2011

Numerical variables profiling in very large data set

Profiling numerical variables is an integral part of data analytics, which generally consists of obtaining standard descriptive statistics such as quantiles, first central moments as well as missing ratio.

It is easily obtainable by using PROC MEANS (or PROC SUMMARY). But when we face very large data with many varibles, we will hit memory wall and very long processing time using default options in PROC MEANS. The most time consuming and memory intensive descriptive statistics is quantile calculation. The default method uses ordered statistics, which is the most accurate but also the most memory intensive and time consuming one.

There are two methods available to handle this scenario, both uses the one-pass method of Jain R. and Chlamtac I [1]. This method is able to obtain fairly accurate estimate of quantiles between P25 and P75, but for quantiles outside this range, the estimates will be more rough but given very large sample, maybe acceptable. Another problem is that this estimator is sensitive to the distribution of underlying data [2]. There are newer methods available that are insensitive to the distribution, such as [3].

In PROC MEANS, we can specify QMETHOD=P2 or QMETHOD=HIST to call this one pass method. Or we can use PROC STDIZE, specifying PCTLMTD=ONEPASS to invoke P2 method to calculate quantiles. Each has its pro and con.

PROC MEANS:
PRO: Multithreaded / rich set of descriptive statistics;
CON: Only pre-defined quantiles, output data set is not user friendly, in need of further manipulation;

PROC STDIZE:
PRO: quantiles of 0 to 100 are available, statistics output data is in a user friendly format;
CON: No multithreads, lack of higher central moments statistics, need to surpress data output explicitly;

Of couse, due to the way the output statistics are organized from PROC STDIZE OUTSTAT=, it can be parallelized very easily, and by examining the actual CPU time, PROC STDIZE is more efficient than PROC MEANS. Besides, higher central moments can be calculated from obtained first 2 central moments and this is done in the small output data set.

Below the difference between default method and the two new approaches is illustrated.


options fullstimer;
data test;
        length id 8;
  array x{100} ;
  do id=1 to 1e5;
      do j=1 to dim(x); x[j]=rannor(0)*1.2; end;
   output;
   drop j;
  end;
run;


proc means data=test noprint q1  qmethod=os;
        var x1-x100;
  output  out=_mean
             mean=mean1-mean100
                   std=std_x1-std_x100
                   q1=q1_x1-q1_x100
                   p95=p95_x1-p95_x100;
run;

proc means data=test noprint q1  qmethod=p2   qmarkers=211;
        var x1-x100;
  output  out=_mean
             mean=mean1-mean100
                   std=std_x1-std_x100
                   q1=q1_x1-q1_x100
                   p95=p95_x1-p95_x100;
run;

ods select none;
proc stdize data=test  
                   out=_null_  outstat=_stat  pctlmtd=onepass  
       nmarkers=211
                   pctlpts=1 5  10  25 50  75  90  95   99;
         var x1-x100;
run;
ods select all;

Take a look at the log:


Due to smaller sample size, the most gain is on the memory side. Default approach of PROC MEANS used 643MB memory, which specifying QMETHOD=P2, the memory usage reduced to only 7.3MB. The most memory efficient is PROC STDIZE with PCTLMTD=ONEPASS, only 0.68MB memory was used.

We also examine the difference on quantile estimates using the Ordered Statistics (OS) method and P2 method in PROC MEANS. There are observable differences but not that significant:




Reference:
[1]. Jain R. and Chlamtac I. (1985), "The Algorithm for Dynamic Calculation of Quantiles and Histograms Without Storing Observations," Communications of the ACM, 28(10), 1076–1085.
[2]. SAS/STAT(R) 9.2 User's Guide, Second Edition
[3]. Alsabti, Khaled; Ranka, Sanjay; and Singh, Vineet (1997), "A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data". L.C. Smith College of Engineering and Computer Science - Former Departments, Centers, Institutes and Projects. Paper 4. http://surface.syr.edu/lcsmith_other/4

Tuesday, June 07, 2011

%HPGLIMMIX macro on large scale HMM

PROC GLIMMIX is good tool for generalized linear mixed model (GLMM), when the scale is small to medium. When facing a large scale GLMM, such as modeling all ZIPs nested in Counties nested in all 51 States in US, a 64-bit machine with extremely large memory is required and the computing may last for months! In a strictly nested hierarchical model, the variance covariance matrix is very sparse, and taking advantage of this property can accelerate computing by many folds.

The %HPGLIMMIX SAS macro is made for large scale Hierarchical Mixed Models. As an example, a sample data using Gamma Regression is shown below, with all ZIPs in AK, AL, AR, AZ with 2-level hierarchies: State and ZIP within State, total 4 blocks with max 693 columns per block. The reason not all ZIPs and all states are used is simply because PROC GLIMMI blows up on the machine.

Copmaring the estimates and std errors from both runs, they are the same, but drastically different running time of 71sec using %HPGLIMMIX v.s. 35min39sec using GLIMMIX.




1715
1716  options nomprint nomlogic;
1717  %hpglimmix(data=temp2,
1718             stmts=%str(
1719                    class zip  zip_state;
1720                    model y = x ;
1721                    random int zip/subject=zip_state;
1722                      ),
1723             error=gamma,
1724             link=LOG,
1725             options=NOTEST);
NOTEST

       The HPGLIMMIX Macro

Data Set           : WORK.TEMP2
Error Distribution : GAMMA
Link Function      : LOG
Response Variable  : Y


Job Starts at : 06JUN2011:15:51:19
    HPGLIMMIX Iteration History

Iteration    Convergence criterion
    1            0.0081058432  13 sec
    2            0.0004213646  13 sec
    3            2.7137935E-7  13 sec
    4            3.1854799E-9  12 sec

Output from final Proc HPMixed run:
Job Ends at : 06JUN2011:15:52:30
1726  options nomprint nomlogic;
1727
1728  proc glimmix data=temp2;
1729       class zip  zip_state;
1730       model y = x /s  dist=gamma;
1731       random int zip /subject=zip_state;
1732  run;



NOTE: Convergence criterion (PCONV=1.11022E-8) satisfied.
NOTE: PROCEDURE GLIMMIX used (Total process time):
      real time           35:38.93
      cpu time            34:30.90


GLIMMIX output:

              Covariance Parameter Estimates

                                                  Standard
             Cov Parm     Subject      Estimate     Error

             Intercept    zip_state    0.000152    0.000125
             zip          zip_state    0.000066    3.105E-6
             Residual                  0.000405    2.281E-6

             Solutions for Fixed Effects

                                Standard
     Effect       Estimate       Error       DF    t Value    Pr > |t|

    Intercept      6.5873    0.006180        3    1065.95      <.0001
    x              0.003436    0.000218    62634      15.79      <.0001




%HPGLIMMIX output:

                _cov data from  HPGLIMMIX

               Obs    CovParm       Subject     Estimate

                1     Intercept    zip_state    0.000152
                2     zip          zip_state    0.000066
                3     Residual                  0.000405

               _soln data from HPGLIMMIX

               Obs    Effect       Estimate      StdErr

                1     Intercept      6.5873      0.006180
                2     x              0.003436    0.000218

Sunday, April 10, 2011

Regularized Discriminant Analysis

Demo SAS implementation of Regularized (Linear) Discriminate Analysis of J. Friedman (1989)[1]. Simpler introduction can be found at [2]. Regularized QDA follows similarly.

To save coding, I called R within SAS to finish the computation. For details to see how to call R within SAS, check here.








Reference:
1. Friedman, J. (1989). Regularized discriminant analysis, Journal of the American Statistical Association 84: 165-175.
2. Friedman, J; Hastie, T; Tibshirani, R (2008). The Elements of Statistical Learning, section 4.3.1

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)