Saturday, September 22, 2007

Table lookup, all techniques

Table lookup is a fundamental task SAS programmers are performing everyday. Here I demonstrate all possible cases in Table Lookup and corresponding efficient SAS code.

1. The simplest case is that lookup table table has unique ID variable while master table is pre-sorted. In such case, a MERGE BY statement is more efficient than enough. Alternatively, if lookup table has unique ID variable but master table is too large that pre-sort is not a good idea (such as a billion-record master table), then if you have SAS v9, hash table is a good choice, or we can use KEY= option on the lookup table in a nested SET statement, which is the best solution in both versions because hash table is slower than setting an indexed data.

Using hash table, we have:
The data:

uniq_data

Using DataStep Component Object:

uniq_component_obj

Using SET KEY=:

uniq_set_key


(to be continued...)

不要总是用宏来解决问题

发信人: raze (calm+down), 信区: Statistics
标 题: 请问谁能帮我简化下sas的code?多谢。
发信站: BBS 未名空间站 (Fri Sep 21 20:38:41 2007), 转信

我有一组数据如下,data one。我已经算出了class1=1的所有id相同的体重和身高的和
,并且存入了名字为class1_sum的文件中。然后我用同样的方法算了class2=1,class3=
1,class4=1的所有id相同的人的体重和身高的和,分别生成了另外的三个文件。
怎么样能有用一次sql或者其他命令,达到以上目的呢?多谢。
data one;
input id obs weight height class1 class2 class3 class4;
datalines;
1 1 2 3 1 0 1 1
1 2 3 4 1 1 0 0
2 1 4 5 0 1 1 1
2 2 5 6 1 1 1 1
3 1 1 2 1 1 0 1
4 1 3 5 1 0 1 0
;run;

我的解决方案见图,自认为比用宏的方法好,效率更高。


single-step


SQL solution
--------------------------
data data1;
input id1 $ id2 $ month year ;
datalines;
1 a 3 1978
2 a 4 1977
3 a 2 1978
4 b 3 1977
5 b 2 1977
6 c 3 1977
7 c 9 1991
8 b 7 1974
9 a 6 1980
;


proc sql;
create table new as
select a.id2, a.id1, a.month, a.year,
sum((b.id2=a.id2)&
((b.month<=a.month & b.year=a.year)|(b.month>a.month & b.year=a.year-1)))-1
as Freq
from data1 as a, data1 as b
group by a.id2, a.id1, a.month, a.year;
quit;

Sunday, September 16, 2007

Tracing Events, question from Statistics board @mitbbs.com

Update:
This problem turns out to be a Deep First Search (DFS) problem for graph structure. For a general solution of DFS, see my blog post @ Here .

**********************************************************************************
Tracing events.
The objective of this work is to identify total length of streams identified by ID that are linked by nodes, reversely from TONODE to FROMNODE, all the way to the final node that are not connected with any other streams. If STARTFLAG or BRANCH is 1, no further reverse travel is necessary.

data:

ID LENGTHKM FROMNODE TONODE STARTFLAG BRANCH

T 0.43 773501327 773500764 0 0
kk 1.19 773501293 773501327 0 1
S 0.82 773500791 773501327 1 0
R 4.53 773500796 773500769 0 0
P 3.47 773500809 773500796 0 0
O 1.02 773500821 773500809 0 0
nn 0.86 773500836 773500821 0 1
N 1.08 773500836 773500821 0 0
M 1.14 773500218 773500836 1 0
hh 0.89 773500807 773500809 0 1
Q 0.29 773500778 773500796 0 0
ll 0.18 773501344 773500778 0 1
L 0.62 773501344 773500778 0 0
K 0.09 773501293 773501344 0 0
J 1.78 773500800 773501293 0 0
H 0.43 773500807 773500800 0 0
ff 0.40 773500830 773500807 0 1
I 0.31 773500814 773500800 0 0
gg 0.42 773500813 773500814 0 1
G 0.56 773500813 773500814 0 0
F 0.17 773500830 773500813 0 0
E 2.10 773500833 773500830 0 0
dd 0.14 773500845 773500833 0 1
D 0.44 773500845 773500833 0 0
C 1.36 773500879 773500845 0 0
B 3.42 773500403 773500879 1 0
A 3.79 773500885 773500879 1 0


sample of expected results:

Total T = T + kk + S;
Total R = sum (R ---- A);
Total O = O + N + nn + M;
Total P = P + hh + total O;
Total H = H + ff;
Total I = G + gg + F + E --- A;

My solution using hash object is below:




data



program



result


After all streams that are from the same source are identified and flagged, we can simply calculate the sum of length by ID group.