Friday, June 25, 2010

"Entrywise" Norm calculation using PROC FASTCLUS

In some data mining applications, matrix norm has to be calculated, for instance [1]. You can find a detailed explanation of Matrix Norm on Wiki @ Here

Instead of user written routine in DATA STEP, we can obtain "Entrywise" norm via PROC FASTCLUS efficiently and accurately.


data matrix;
     input X1-X5;
datalines;
1 2 4 5 6
7 8 9 0 1
2 3 4 5 6
3 4 5 6 7
7 8 9 0 2
2 4 6 8 0
;
run;

data seed;
     input X1-X5;
datalines;
0 0 0 0 0
;
run;

options nosource;
proc export data=matrix  outfile='c:\matrix.csv'  dbms=csv replace; run;
options source;

proc fastclus data=matrix  seed=seed      out=norm(keep=DISTANCE)
              maxiter=0    maxclusters=1  noprint  ;
     var x1-x5;
run;

/* 
In output file NORM, variable DISTANCE is the square root of Frobenius norm. If LEAST=P option is specified, then p-norm is calculated. In PROC FASTCLUS, you can specify p in the range of  [1, \inf].

Now what you got is vector norm for each row, taking the sum of squares of DISTANCE, you obtain the Frobenius norm of the data matrix, which can be easily obtained through PROC MEANS on a data view: 
*/
data normv/ view=normv;
     set norm(keep=DISTANCE);
     DISTANCE2=DISTANCE**2;
     drop DISTANCE;
run;
proc means data=normv noprint;
     var DISTANCE2;
     output  out=matrixnorm  sum(DISTANCE2)=Frobenius_sqr;
run;

You can use the following R code to verify the results;

mat <- read.csv('c:/matrix.csv', header=T)
#verify vector norm
vnorm <- apply(mat, 1, function(x){sqrt(sum(x^2))});
#verify norm of the matrix
x<-as.matrix(mat)
sqrt(sum(diag(t(x)%*%x)))

PS:
1. Of course, above process is designed for implementing the randomized SVD in [1]. If only the matrix Frobenius norm is of interests, you can also use the following code snippet:


data matrixv/view=matrixv;
     set matrix;
     array _x{*}  x1-x5;
     array _y{*}  y1-y5;
     do j=1 to dim(_x);  _y[j]=_x[j]**2; end;
     keep y1-y5;
run;

proc means data=matrixv  noprint;
     var y1-y5;
     output  out=_var(drop=_TYPE_  _FREQ_)   sum()=/autoname;
run;

data _null_;
     set _var;  
     norm=sqrt(sum(of _numeric_));
     put norm=;
run;
/* --LOG WRITES:
norm=28.635642127
NOTE: There were 1 observations read from the data set WORK._VAR.
*/


2. Using its built-in computing engine for Eucleadian Distance, PROC FASTCLUS is also a powerful tool to search for the data point in main table that is CLOEST to the a record in lookup table. This technique is shown Here and [2].


Reference:
[1], P. Drineas and M. W. Mahoney, "Randomized Algorithms for Matrices and Massive Data Sets", Proc. of the 32nd Annual Conference on Very Large Data Bases (VLDB), p. 1269, 2006.

[2], Dorfman, Paul M.; Vyverman, Koen; Dorfman, Victor P., "Black Belt Hashigana", Proc. of the 2010 SAS Global Forum, Seattle, WA, 2010

0 comments: