Sunday, March 28, 2010

Implement Boost Algorithm in SAS

Boost algorithms are proven to be very effective data mining tools, either used stand alone, or as a building block to handle nonlinearity, etc. Implementation of Boost algorithm in SAS is not easy to find although it is not difficult to write one yourselve. I completely rewrite the %Boost macro in the book "Pharmaceutical Statistics Using SAS: A Practical Guide" [1], which covers AdaBoost, RealBoost, GentlBoost and LogitBoost algorithms. Related %Predict macro is very straight forward to rewrite, too. Note that the author uses Gini index as impurity measurement, while in most standard implementations, a weak classifier directly minimizes weighted error rate [2].

The original macro (Found @ Here) uses SAS/IML and the way it handles computation makes it impossible to work on data set with even moderate size. For example, working on a table with 4000+ observations, the original macro will utilize >850M memory and may cause the program collapse. In many rare event study, a larger sample is necessary. For example, a fraud detection program with 0.01% fraud case may well require >100K total records for a reliable analysis. A much improved SAS/IML version can be found @ here.

New macro utilizing DATA STEP will be more tolerable to very large data set, and is much faster in this case (when your data consists less than 3000 observations, the original macro works much faster). Basically computing time and space requirement of this macro is linear in the number of observations while quadratic for original macro. The new macro will involve computing modules %gini, %stump_gini, %csss, %stump_css in this post (@ Here).

Note that if duplicate values exist, these two macros will not produce exactly the same result if different response categories appear among the duplicate valued records, because of the way these responses arranged. But both programs works in the sense that they will reduce error rate as iteration goes on.

Future development includes weak classifier directly min error rate, and using a more effecient way to handle small data set.

Reference:
[1]. Alex Dmitrienko, Christy Chuang-Stein, Ralph B. D'Agostino, "Pharmaceutical Statistics Using SAS: A Practical Guide", SAS Institute 2007;
[2]. Yoav Freund & Robert E. Schapire, "A Short Introduction to Boosting", Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999;

3 comments:

Diego said...

Hi. i found this post but I can't seem to find the macro you mention. Where can I get it? Thanks.

Liang Xie said...

check this post"

http://www.sas-programming.com/2010/04/improve-boost-macro-from-rayens-w-and.html

Diego said...

Thanks! if I make use of it I'll let you know how it went.