Wednesday, January 10, 2007

Cartesian Product Self Join using SubQuery

Two datasets
1.
group cat count
1 2 10
1 3 11
2 3 29
2 4 33
3 1 2
3 4 4
3 5 1

Need to transform it to this:
group cat count
1 1
1 2 10
1 3 11
1 4
1 5
2 1
2 2
2 3 29
2 4 33
2 5
3 1 2
3 2
3 3
3 4 4
3 5 1

This is a Catesian Product Self Join. Solving it by Subquery.
data a;
input group cat count;
datalines;
1 2 10
1 3 11
2 3 29
2 4 33
3 1 2
3 4 4
3 5 1
;

proc sql;
create table a1 as select x.group, y.cat
from
(select distinct group from a ) as x,
(select distinct cat from a) as y
;

create table a2 as
select a1.*, a.count
from a right join a1
on a.group=a1.group
and a.cat=a1.cat
;
quit;

This program, however, assumes the all distinct item in the second column will appear. If it is not the case, follow these steps:

extract max(cat)
generate a data set with cat=1...max_cat,
then use the above code

these steps are unavoidable.

Therefore, this more complicated scenarior can be handled by SAS Data step in a more efficient and compact way.

Relative Efficiency of Cartesian Product Join in SAS

sasfile a close;
options fullstimer details notes msglevel=i;
data a;
input company $ zip x y;
datalines;
C1 12345 90 80
C2 13425 75 66
C3 34523 11 21
;
data b;
input zip x y;
datalines;
12345 90 80
13245 04 06
13254 67 09
13425 75 66
65234 85 12
34543 11 21
34542 19 41;

sasfile a load;

/*1: */
data c;
set b(rename=(zip=zip_b
x=x_b
y=y_b
))
;
do i=1 to num_b;
set a nobs=num_b point=i;
zip_pair=trim(zip)'<-->'trim(zip_b);
distance=sqrt((abs(x_b-x))**2+(abs(y_b-y))**2);
output;
end;
run;

/*2: */
proc sql;
create table d as
select a.*, b.*, sqrt((abs(x_b-x))**2+(abs(y_b-y))**2) as distance
from a as a,
b(rename=(zip=zip_b x=x_b y=y_b)) as b;
quit;

/*3: */
data c;
if _n_=1 then do;
declare hash h(hashexp:2);
declare hiter hit('h');
h.defineKey('company');
h.defineData('zip', 'x', 'y', 'company');
h.defineDone();
do until (eof);
set a end=eof;
h.add();
end;
end;
set b(rename=(zip=zip_b
x=x_b
y=y_b
));
do while(hit.next()=0);
ZipPair = catx('<-->', zip, zip_b);
distance=sqrt((abs(x_b-x))**2+(abs(y_b-y))**2);
output;
end;
run;

/*Log shows the relative efficiency of three methods. Notice their memory usage */
1:

NOTE: Numeric values have been converted to character values at the places given by:
(Line):(Column).
9926:23 9926:42
NOTE: There were 7 observations read from the data set WORK.B.
NOTE: The data set WORK.C has 21 observations and 9 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
user cpu time 0.01 seconds
system cpu time 0.00 seconds
Memory 214k


2:

NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
NOTE: Table WORK.D created, with 21 rows and 8 columns.
9936 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.09 seconds
user cpu time 0.00 seconds
system cpu time 0.01 seconds
Memory 322k


3:

NOTE: There were 3 observations read from the data set WORK.A.
NOTE: There were 7 observations read from the data set WORK.B.
NOTE: The data set WORK.C has 21 observations and 9 variables.
NOTE: DATA statement used (Total process time):
real time 0.53 seconds
user cpu time 0.01 seconds
system cpu time 0.00 seconds
Memory 535k