Friday, July 30, 2010

An Economic Approach for a Class of Dimensionality Reduction Techniques

Just back from KDD2010. In the conference, there are several papers that interested me.

On the computation side, Liang Sun et al.'s paper [1], "A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques" caught my eyes. Liang proves that a class of dimension reduction techniques, such as CCA, OPLS, LDA, etc, that relies on general eigenvalue decomposition, can be computed in a much cheaper way by decomposing the original computation into a least square problem and a much smaller scale eigenvalue decomposition problem. The equivalence of their two stage approach and direct eigenvalue decomposition is rigourously proved.

This technique is of particular interest to ppl like me that only have limited computing resources and I believe it would be good to implement their algorithm in SAS. For example, a Canonical Discriminant Analysis with above idea is demonstrated below. Note also that by specifing RIDGE= option in PROC REG, the regularized version can be implemented as well, besides, PROC REG is multi-threaded in SAS. Of course, the computing advantage is only appreciatable when the number of features is very large.

The canonical analysis result from reduced version PROC CANDISC is the same as the full version.

In fact, this exercise is the answer for Exercise 4.3 of The Elements of Statistical Learning [2]

[1]. Liang Sun, Betul Ceran, Jieping Ye, "A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques", KDD2010, Washington DC.

[2]. Trevor Hastie, Robert Tibshirani, Jerome Friedman, "The Elements of Statistical Learning", 2nd Edition.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)




   proc format; 
      value specname 
         1='Setosa    ' 
         2='Versicolor' 
         3='Virginica '; 
   run; 
 
   data iris; 
      title 'Fisher (1936) Iris Data'; 
      input SepalLength SepalWidth PetalLength PetalWidth 
            Species @@; 
      format Species specname.; 
      label SepalLength='Sepal Length in mm.' 
            SepalWidth ='Sepal Width in mm.' 
            PetalLength='Petal Length in mm.' 
            PetalWidth ='Petal Width in mm.'; 
      symbol = put(Species, specname10.); 
      datalines; 
   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3 
   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2 
   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2 
   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3 
   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3 
   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3 
   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2 
   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3 
   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1 
   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1 
   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1 
   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2 
   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1 
   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1 
   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1 
   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1 
   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1 
   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1 
   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3 
   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1 
   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3 
   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1 
   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1 
   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1 
   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2 
   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2 
   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1 
   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3 
   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1 
   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2 
   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3 
   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3 
   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3 
   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1 
   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1 
   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1 
   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2 
   63 33 60 25 3 53 37 15 02 1 
   ; 
   proc candisc data=iris out=outcan distance anova; 
      class Species; 
      var SepalLength SepalWidth PetalLength PetalWidth; 
   run;
 
  ods select none;
  proc glmmod data=iris  outdesign=H(keep=COL:);
           class  Species;
     model SepalLength=Species/noint;
  run;  

  data H;
          merge H   iris;
  run;

/**************************
for efficiency consideration, a view can also be used:
data H/view=H;
     set iris;
     array _S{*} Col1-Col3 (3*0);     
     do j=1 to dim(_S); _S[j]=0; end;
     _S[Species]=1;
     drop j;
run;
****************************/
  proc reg data=H  outest=beta;
          model Col1-Col3 = SepalLength SepalWidth PetalLength PetalWidth;
    output   out=P  p=yhat1-yhat3;
  run;quit;
  ods select all;


  proc candisc  data=P;
          class Species;
    var   yhat1-yhat3;
  run;

Monday, July 12, 2010

Implement Randomized SVD in SAS

In the 2010 SASware Ballot®, a dedicated PROC for Randomized SVD was among the options. While an official SAS PROC will not be available in the immediate future as well as in older SAS releases, it is fairly simple to implement this algorithm using existing SAS/STAT procedures.

Randomized SVD will be useful for large scale, high dimension data mining problems, for instance Text Mining. In SAS/Base and SAS/STAT, lack of sparse matrix operation capability puts any serious Text Mining task at the edge of infeasibility, such as using LSI or NMF algorithms. Randomized SVD provides an economic alternate solution by sacrificing a little accuracy which is bounded under the three sampling schema proposed by the authors [1], while the code below demos sampling schema 1.



/* Randomized SVD with sampling schema 1. */
%let dim=2048;
%let nobs=1e4;
%let s=256;
data matrix;
     array _x{*} x1-x&dim;
     do id=1 to &nobs;
     do _j=1 to dim(_x); _x[_j]=sin(mod(id, _j))+rannor(id); end;
  output;
  drop _j;
  end;
run;

%let datetime_start = %sysfunc(TIME()) ;
%let time=%sysfunc(datetime(), datetime.); %put &time;
data seed;
     array _x{*} x1-x&dim;
     do _j=1 to dim(_x); _x[_j]=0; end;
  output;
  stop;
run;

proc fastclus data=matrix  seed=seed      out=norm(keep=ID DISTANCE)
              maxiter=0    maxclusters=1  noprint  replace=none;
     var x1-x&dim;
run;
data normv/ view=normv;
     set norm(keep=DISTANCE);
     DISTANCE2=DISTANCE**2;
     drop DISTANCE;
run;
proc means data=normv noprint;
     var DISTANCE2;
     output  out=matrixnorm  sum(DISTANCE2)=Frobenius_sqr;
run;
data prob;
     set matrixnorm ;
  retain Frobenius_sqr;
  do until (eof);
     set norm  end=eof;
  _rate_=DISTANCE**2/Frobenius_sqr;
  keep ID _rate_;
  output;
  end;
run;

data matrixv/view=matrixv;
     merge matrix  prob(keep=_rate_);
run;

ods select none;
proc surveyselect data=matrixv  out=matrixsamp(drop=SamplingWeight  ExpectedHits  NumberHits _rate_)  
                   sampsize=&s  method=pps_wr   outhits  ;
     size _rate_;
run;
ods select all;

proc transpose data=matrixsamp  out=matrixsamp;
     var x1-x&dim;
run;

proc princomp data=matrixsamp  outstat=testv(where=(_type_ in ("USCORE")))  
              noint  cov  noprint;
  var col1-col&s;
run;
data testV_t/view=testV_t;
     retain _TYPE_ 'PARMS';
  set testv(drop=_TYPE_);
run;     

proc score data=matrixsamp   score=testV_t  type=parms  
           out=SW(keep=ID Prin:);
    var   col1-col&s;
run;

data seed;
     array _s{*}  prin1-prin&s;
  do _j=1 to dim(_s); _s[_j]=0; end;
  drop _j; output; stop;
run;

proc fastclus data=SW   seed=seed  maxiter=0  maxc=1  replace=NONE   out=SW2(drop=CLUSTER)  noprint;
     var prin1-prin&s;
run;

data HHT;
     set SW2;
  array _x{*}  prin1-prin&s;
  do _j=1 to dim(_x); _x[_j]=(_x[_j]/distance)**2; end;
  drop _j  distance;
run;

proc transpose data=HHT  out=HHT2(drop=_LABEL_);      
run;
data HHT2; 
     _TYPE_='PARMS';
     set HHT2; 
  rename COL1-COL2048=x1-X2048;
run;

proc score data=matrix   score=HHT2  type=parms  out=P(drop=x:);
     var x:;
run;     

%let time=%sysfunc(datetime(), datetime.); %put &time;
%put PROCESSING TIME:  %sysfunc(putn(%sysevalf(%sysfunc(TIME())-&datetime_start.),mmss.)) (mm:ss) ;
options notes source;


Reference:
[1], P. Drineas and M. W. Mahoney, "Randomized Algorithms for Matrices and Massive Data Sets", Proc. of the 32nd Annual Conference on Very Large Data Bases (VLDB), p. 1269, 2006.