SAS Programming for Data Mining: Benchmark Regression Procedures using OLS Regression

Rick Wicklin discussed in his blog the performance in solving a linear system using SOLVE() function and INV() function from IML.

Since regression analysis is an integral part of SAS applications and there are many SAS procedures in SAS/STAT that are capable to conduct various regression analysis, it would be interesting to benchmark their relative performance using OLS regression, the fundamental regression analysis of all.

The analysis will compare REG, GLMSELECT, GENMOD, MIXED, GLIMMIX, GLM, ORTHOREG, HPMIXED and TRANSREG on 10 OLS regressions with 100 to 1000 variables, incremental at 100, and with the number of observations twice the number of variables to avoid possible numerical issues. HPMIXED uses sparse matrix techniques and will be put into great disadvantage in this comparison using large dense matrices. A macro wraps them together:



%macro wrap;
proc printto log='c:\testlog.txt';run;

%let t0=%sysfunc(datetime(), datetime.);
%let procnames=GLM REG GLMSELECT ORTHOREG MIXED GLIMMIX GENMOD ;
%let nproc=%sysfunc(countW(&procnames));
%put Test &nproc PROCEDURES;
%do i=1 %to 10;
    %let nobs=%sysevalf(&i*100);
    options nonotes;
    data _temp;
         array x{&nobs};
	 do i=1 to 2*&nobs;
	    do j=1 to &nobs;
	       x[j]=rannor(0);
            end;
	    y=rannor(0);
	    drop i j;
	    output;
	 end;		 
     run;
     options notes;
     sasfile _temp load;
     ods select none;

     %do j=1 %to &nproc;
         %let proc=%scan(&procnames, &j);
	 %put &proc;
	 proc &proc data=_temp;
	      model y = x1-x&nobs;
	 run;
    %end;
    %put TRANSREG;
    proc transreg data=_temp;
         model identity(y) = identity(x1-x&nobs);
    run;
    sasfile _temp close;
    ods select all;
%end;
proc printto; run;
%mend;
%wrap;

After running all iterations, the SAS log is parsed to obtain procedure names and corresponding real time and CPU time. The following SAS code does this job:



data proc_compare;
     infile "c:\testlog.txt";
	 input;
	 retain procedure ;
	 retain realtime cputime  realtime2 ; 
	 length procedure $12.;
	 length realtime  cputime $24.;
	 if _n_=1 then id=0;
	 x=_infile_;
	 if index(x, 'PROCEDURE')>0 then do;
	    procedure=scan(_infile_, 3);		
		if procedure="REG" then id+1;		
	 end;
	
	 if index(x, 'real time')>0 then do;
	    _t1=index(_infile_, 'real time');
	    _t2=index(_infile_, 'seconds');
	    if _t2=0 then _t2=length(_infile_);
            realtime=substr(_infile_, _t1+9, _t2-_t1-9);
	    if index(realtime, ':')>0 then do;
 	       realtime2=scan(realtime, 1, ':')*60;
	       sec=input(substr(realtime, index(realtime, ':')+1), best.);
	       realtime2=realtime2+sec;		 
	    end;
	    else realtime2=input(compress(realtime), best.);
	 end;
	 if index(x, 'cpu time')>0 then do;
	    _t1=index(_infile_, 'cpu time');
	    _t2=index(_infile_, 'seconds');
	    if _t2=0 then _t2=length(_infile_);
	    cputime=substr(_infile_, _t1+8, _t2-_t1-8);
	    if index(cputime, ':')>0 then do;
 	       cputime2=scan(cputime, 1, ':')*60;
	       sec=input(substr(cputime, index(cputime, ':')+1), best.);
	       cputime2=cputime2+sec;
	    end;
	    else cputime2=input(compress(cputime), best.);
	    keep id size  procedure cputime2 realtime2 ;
	    size=id*100;
	    if compress(procedure)^="PRINTTO" then output;
	end;
run;

We then visualize the results using the following code:



title "Benchmark Regression PROCs using OLS";
proc sgpanel data=proc_compare;
     panelby procedure /rows=2;
     series y=cputime2  x=size/ lineattrs=(thickness=2);
	 label cputime2="CPU Time (sec)"
	       size="Problem Size"
		   ;;	
	 colaxis grid;
	 rowaxis grid;
run;
title;

title "Closer Look on REG vs. GLM vs. GLMSELECT";
proc sgplot data=proc_compare  uniform=group;
     where procedure in ("GLMSELECT", "REG", "GLM");
     series x=size y=cputime2/group=procedure  curvelabel lineattrs=(thickness=2);
	 label cputime2="CPU Time (sec)"
	       size="# of Variables"
		   ;;
     yaxis grid ;
     xaxis grid ;
run;
title;

It is found that PROC GLM and GLMSELECT beat all other procedures with large margin while HPMIXED is the slowest followed by GLIMMIX. Surprisingly, REG is slower than both GLM and GLMSELECT even though it utilized multi-threading technique while GLMSELECT does not:

************ Partial LOG of the last iteration ********
NOTE: PROCEDURE REG used (Total process time):
real time 6.79 seconds
cpu time 9.36 seconds

NOTE: There were 2000 observations read from the data set WORK._TEMP.
NOTE: PROCEDURE GLMSELECT used (Total process time):
real time 3.06 seconds
cpu time 2.96 seconds
********************************************************

The performance gap between REG and GLM/GLMSELECT is getting larger when the number of variables increases to be more than 700.

Both REG and GLMSELECT are developed by the same group of developers in SAS, as far as I know.

********************* PS : ****************************
Rick and Charlie pointed out that real time is a more fair measure, which I agree.

The reading of real computing time has large variance from run to run because the testing enviornment is not very clean and there are many background window programs running. Below is part of the log file of another run with 2000 variables and 4000 records:

NOTE: PROCEDURE REG used (Total process time):
      real time           2.26 seconds
      cpu time            7.76 seconds


NOTE: PROCEDURE GLM used (Total process time):
      real time           3.57 seconds
      cpu time            4.58 seconds

NOTE: There were 2000 observations read from the data set WORK._TEMP.
NOTE: PROCEDURE GLMSELECT used (Total process time):
      real time           3.50 seconds
      cpu time            3.44 seconds

We see that REG has lower real time comparing to GLM/GLMSELECT, even though cpu time is about twice the average of GLM/GLMSELECT. In a case where BY-processing is used, GLMSELECT will use multi-threading as specified in PERFORMANCE statement, and the gap in real time between REG and GLMSELECT will be eliminated. In a collaborating environment, more CPU time also means competing for more resources. Below we show the real time of the same run as in above CPU Time figure.

Note that CPU Time difference and pattern is pretty consistent.Below is the mean CPU Time and its 90% C.I. of 100 runs using REG /GLM /GLMSELECT on different size of problems.

5 comments:

CHARLIE HUANG said...: Beautiful post to compare a number of GLM procedures in SAS.

The conclusion seems like: the older the procedure is, the more efficient it is. ^-^ Let’s go use PROC REG; 11:20 AM, August 19, 2011
Rick Wicklin said...: I don't think you want to be comparing the CPU times: multithreaded code uses more CPU, but has reduced REAL times because it uses multiple CPUs concurrently.; 11:38 AM, August 19, 2011
Liang Xie said...: Thanks for the comments.

Charlie and Rick, you are both right that in terms of real time, REG is still better, and outperforms GLMSELECT by about 35%.

REG is particularly handy when conduct Ridge Regression and other various analysis.

On the other hand, GLMSELECT will probablly outperform REG when a BY-Processing is the case where GLMSELECT will utilitze mutliple cores in computing.

I have added information about real time.; 12:09 AM, August 20, 2011
wei said...: i think you have to separate those proc that rely on matrix inversion/multiplication (proc reg) and those use iterative methods to maximize likelihood (proc mixed).; 4:35 PM, August 21, 2011
Liang Xie said...: Wei, you are absolutely right, however, this post is simply an echo to Rick's post and only for illustration purpose.

Thank you for the clarification, though.; 12:09 PM, August 27, 2011

SAS Programming for Data Mining

Page Title

Thursday, August 18, 2011

Benchmark Regression Procedures using OLS Regression

5 comments:

Pageviews last month