Wednesday, November 16, 2011

Using PROC CANCORR to solve large scale PLS problem



Partial Least Square (PLS) is a powerful tool for discriminant analysis with large number of predictors [1].

PLS extracts latent factors that maximize the covariance between independent variables and dependent variables. This process is equivalent to Generalized Eigenvalue Decomposition of the following formula [2]:
$$X'HXw =\phi X'Xw $$. For PLS $$H=Y'Y$$ Note that Canonical Correlation Analysis (CCA) follows the same generalized eigenvalue decomposition problem, specifically, for CCA, $$H=Y'(YY')^{-1}Y$$.

In SAS, PROC PLS implements 2 forms of PLS, namely the original NIPALS [2] and SIMPLS [3]. When there is only one dependent variable, the two algorithms generate the same output.PLS is a computationally very demanding algorithm. While powerful, when the dimension of the problem at hand becomes very large, PROC PLS will encounter issues such as insufficient memory and very long computing time. There is a rescue when only one dependent variable Y presents. In this case, CCA and PLS differ only up to a fixed scale parameter. Therefore we can use PROC CANCORR, which is very scalable and multithreaded, to solve the PLS problem. The obtained weights and loadings will not be the same but the difference is only up to a fixed scale parameter.

In the follow log, we demonstrate the behavior of PROC PLS and PROC CANCORR on a server with 4GB accessible memory when the number of independent variable is 5000 and sample size is 100K. PROC PLS reported insufficient memory and stopped computing in 45seconds after exhausting all accessible memory, while PROC CANCORR continued and finished computation in slightly more than 7 minutes. Both procedures used up 3.92GB memory available to SAS from the system. Also note that PROC CANCORR used more than 33minutes of CPU time, indicating its very good scale up capability in a multi-core environment.

Referece:
[1] Barker, M and Rayens, W (2003), "Partial Least Squares for Discrimination", Journal of  Chemometrics, 17, 166-173

[2] Sun, L; Ji, S; Yu, S; and Ye, J (2009), "On the Equivalence Between Canonical Correlation Analysis and Orthonormalized Partial Least Squares", In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI 2009).

[3] Wold, H. (1966), “Estimation of Principal Components and Related Models by Iterative Least Squares,” in P. R. Krishnaiah, ed., Multivariate Analysis, New York: Academic Press.

[4] de Jong, S. (1993), “SIMPLS: An Alternative Approach to Partial Least Squares Regression,” Chemometrics and Intelligent Laboratory Systems, 18, 251–263.



NOTE: PROCEDURE PRINTTO used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      Memory                            3920274k
      OS Memory                         3926280k
      Timestamp            11/16/2011  2:11:22 PM
      Page Faults                       0
      Page Reclaims                     9
      Page Swaps                        0
      Voluntary Context Switches        1
      Involuntary Context Switches      1
      Block Input Operations            0
      Block Output Operations           0
      

16         options fullstimer;
17         data x;
18              length id y x: 8;
19           array x{5000};
20              do id=1 to 1E5;
21              y=rannor(0);
22              do j=1 to dim(x);
23              x[j]=rannor(0);
24           end;
25           output;
26           drop j;
27           end;
28         run;

NOTE: The data set WORK.X has 100000 observations and 5002 variables.
NOTE: Compressing data set WORK.X increased size by 0.06 percent. 
      Compressed is 100066 pages; un-compressed would require 100009 pages.
NOTE: DATA statement used (Total process time):
      real time           1:07.58
      user cpu time       1:02.41
      system cpu time     5.16 seconds
      Memory                            3920274k
      OS Memory                         3926280k
      Timestamp            11/16/2011  2:12:29 PM
      Page Faults                       0
      Page Reclaims                     510
      Page Swaps                        0
      Voluntary Context Switches        63
      Involuntary Context Switches      107
      Block Input Operations            0
      Block Output Operations           0
      

29         
30         
31         proc pls data=x method=simpls noprint;
32              model y =x1-x5000;
33         run;

ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: There were 100000 observations read from the data set WORK.X.
NOTE: PROCEDURE PLS used (Total process time):
      real time           45.89 seconds
      user cpu time       40.11 seconds
      system cpu time     5.77 seconds
      Memory                            3920284k
      OS Memory                         3926280k
      Timestamp            11/16/2011  2:13:15 PM
      Page Faults                       0
      Page Reclaims                     978075
      Page Swaps                        0
      Voluntary Context Switches        91
      Involuntary Context Switches      162
      Block Input Operations            0
      Block Output Operations           0
      
35         proc cancorr data=x noprint;
36              var y;
37           with x1-x5000;
38         run;

NOTE: PROCEDURE CANCORR used (Total process time):
      real time           7:02.85
      user cpu time       33:13.85
      system cpu time     4.40 seconds
      Memory                            3920284k
      OS Memory                         3926280k
      Timestamp            11/16/2011  2:20:18 PM
      Page Faults                       0
      Page Reclaims                     126339
      Page Swaps                        0
      Voluntary Context Switches        2096
      Involuntary Context Switches      83359
      Block Input Operations            0
      Block Output Operations           0
      

39    
40         proc printto; run;