Friday, October 29, 2010

Summarize Numerical Data in a Rolling Window

Obtain summary statistics over a rolling window for a given data, usually on a time dimension, is not quit easy in SAS, especially the rolling window may contain different number of records and the maximum number is unknown without pass the data once. For example, given a transaction data over several days, a business analyst wants to summarize the data for each 24 hour period. This is actually a recent question on SAS-L.

There are several approaches. Typically people use an array that is large enough to handle a reasonable guess of maximum number of records within the given interval. Or use a hash table so to manage the data cells dynamically. The coder basically needs to build a stack which is first in first out. Both of these methods are not easy to code and error prone for a beginner.

This rolling window problem can be efficiently solved by using MultiLabel Format, a feature tend to be ignored. Here are two example. We want to summarize for a rolling window of size 5. Pay attention to the second example.

data fmt;
      retain fmtname 'rollwindow'  type 'n'  hlo 'M';
      do start=1 to 10;
         end=start+5; 
        label=cats('time', start);
        output;
      end; 
      hlo='O'; label='Out-Of-Bound';
      output;
run;

data dsn;
      do time=1 to 20;
         x=rannor(0);
         y=ranuni(0);
         output;
      end;
run;

proc format cntlin=fmt; run;

proc means data=dsn noprint;
      class time /preloadfmt  mlf; 
      format time rollwindow.;
      var x y;
      output  out=summary_roll mean(x y)=  std(x y)= /autoname;
run;
    

data dsn2;
      do time=1 to 20;
         k=ranpoi(10, 10);
         do j=1 to k;
            time=time+j/(k+1);
            x=rannor(0); y=ranuni(0);
            output;
         end;
      end;
      drop k j;
run;
proc means data=dsn2  noprint;
      class time/preloadfmt mlf  exclusive;
      format time rollwindow.;
      var x y;
      output  out=summary_roll2  mean()=  std()=/autoname;
run;


/**************************************************** 
          non-rolling but shrinking time window, 
          similar for growing time window 
*****************************************************/
data fmt2;
      retain fmtname 'winx'  type 'n'  hlo 'M';
      do start=1 to 10;
         end=18; 
        label=cats('time', start);
        output;
      end; 
      hlo='O'; label='Out-Of-Bound';
      output;
run;      

proc format cntlin=fmt2  cntlout=fmt_all;
run;

proc means data=dsn2  noprint;
      class time/preloadfmt mlf  exclusive;
      format time winx.;
      var x y;
      output  out=summary_roll3  mean()=  std()=/autoname;
run;

**************** An example **************;
data TradeDate;
input TradeDate yymmdd10.;
format TradeDate yymmdd10.;
cards;
2007-01-04
2007-01-05
2007-01-08
2007-01-09
2007-01-10
2007-01-11
2007-01-12
2007-01-15
2007-01-16
2007-01-17
2007-01-18
2007-01-19
2007-01-22
2007-01-23
2007-01-24
2007-01-25
2007-01-26
2007-01-29
2007-01-30
2007-01-31
 ;
run;
data raw;
input id $ Date_S yymmdd10. +1 Date_e yymmdd10. Buy;
format Date_S Date_E yymmdd10.;
cards;
A001 2007-01-09 2007-01-24 24.5
A001 2007-01-12 2007-01-16 56.6
 ;
run;

/*------------- Desired Output ------------*
id      Date_S       Date_E        Buy    Hold_Days
A001 2007-01-09 2007-01-24 24.5    12
A001 2007-01-12 2007-01-30 56.6     3
-------------------------------------------*/
data fmt;
     set raw;
     retain fmtname 'tdate' type 'n'  hlo 'M';
     start=Date_S; end=Date_e;
     label=cats(ID, _n_);
run;
proc format cntlin=fmt  out=fmt_ref;
run;

proc means data=tradeDate noprint  nway;
     class TradeDate/mlf exclusive preloadfmt ;
     format TradeDate tdate.;
     var TradeDate;
     output  out=_test  n()=_freq_;
run;


When the data is getting large, there will be some computing difficulty and insufficient resource. To solve this problem, simply sort and divide the original data into smaller pieces with an overlap equals to the size of rolling window. When combine the summarized pieces, you should discard the overlap part from the rest pieces.

The multilabel approach has an overhead that is smaller overall to the other methods and can be easily changed to accommandate other sizes of rolling windows.

No comments: