Tuesday, February 26, 2013

Poor man's HPQLIM?

Tobit model is a type of censored regression and is one of the most important regression models you will encounter in business. Amemiya 1984 classified Tobit models into 5 categories and interested reader can refer to SAS online doc for details. In SAS, PROC QLIM is used to estimate various types of Tobit models as well as related class of models, such as Heckman's selection model, etc.



Jason (Blog @ Analytics-in-writing) in his recent post showed that PROC QLIM is not designed for large data, meaning for data set with millions or more records, and SAS developed HPQLIM, the High Performance version of QLIM, to deal with large scale modeling within Tobit model framework. While I celebrate the roll out of HPQLIM, I think the example Jason used is poorly chosen. He used standard Tobit model (type 1), while he really should use a more complicated one to demonstrate the full power of HPQLIM because there already is a solution in SAS to fit standard Tobit model on very large data set very fast.

Before we dive into the example Jason mentioned, it is useful to review the concept of censoring and truncation in regression analysis. Suppose there is a variable Y following a continuous distribution D and a constant K. After a value y is sampled from Y:

if y < K (y > K) and discard the whole record, then it is left (right) truncation;

if y < K (y > K) and report y=K, then it is left (right) censoring.

Jason showed that, when facing millions of records, PROC QLIM is very slow in estimating even the Type 1 Tobit model, the simplest left censoring at 0, and suggested to use HPQLIM which shows significant improvement in computing time. Unfortunately, not every business can afford High Performance (HPx) solution from SAS but Tobit model is so prevailing in observational study, people may wonder is there a way to get around this within the SAS environment but with much improved speed?

Yes, at least for Type 1 Tobit model. SAS Manual provide an exampling using PROC LIFEREG for Type 1 Tobit model since it is left censoring and the log likelihood function is no difference from a parametric survival model with corresponding distribution:

$$ l = \sum_{i \in (y=0)} \left [ 1- \Phi(X_i\beta/\sigma)\right ] + \sum_{i \in (y>0)} \left [ \Phi(X_i\beta/\sigma)/\sigma \right ] $$

PROC LIFEREG is significantly faster comparing to PROC QLIM as it is optimized for this particular type of model. In order to demonstrate the speed gain, I used a similar data as Jason used, the training sample from KDD CUP 1998. This is the original data where the sample Jason used is from, with 95K observations and 400+ variables. In consistent with Jason's test, we will use 10 randomly selected numerical variables in the model fitting, and stack the data 100 times so that the final data feed to PROC QLIM & PROC LIFEREG has about 9.5 million observations.

The SAS log below shows that, on a laptop with i5 2520M CPU clocked at 2.5Ghz and with 4G memory, PROC QLIM used almost 39 minutes while PROC LIFEREG only used less than 37 sec, more than 60 times difference. In comparison, HPQLIM in Jason's example uses 1/3 of the time of QLIM. Based on the real time data, Jason's OC (old client) machine runs about the same speed as my laptop using QLIM, so I expect the same results from HPQLIM if I could have run one, namely, 12 minutes real time, 27 minutes CPU time. Still not comparable to LIFEREG's less than 1 minute time usage.

6092

6093 %let keepvars=target_: age wealth1 hit CARDPROM numprom cardpm12 numprm12

6094 RAMNTALL NGIFTALL CARDGIFT MINRAMNT LASTGIFT

6095 ;

6096 data pva;

6097 set train(keep=&keepvars);

6098 do i=1 to 100;

6099 output;

6100 end;

6101 run;

NOTE: There were 95412 observations read from the data set WORK.TRAIN.

NOTE: The data set WORK.PVA has 9541200 observations and 15 variables.

NOTE: DATA statement used (Total process time):

real time 1:11.55

cpu time 5.26 seconds
6102

6103 proc qlim data=pva;

6104 model target_d = age wealth1 hit CARDPROM numprom cardpm12 numprm12

6105 RAMNTALL NGIFTALL CARDGIFT MINRAMNT LASTGIFT ;

6106 endogenous target_d~censored(lb=0);

6107 run;
NOTE: Convergence criterion (FCONV=2.220446E-16) satisfied.

NOTE: At least one element of the gradient is greater than 1e-3.

NOTE: PROCEDURE QLIM used (Total process time):

real time 39:51.24

cpu time 39:33.01

6108

6109 /* make it suitable for parametric survival model */

6110 data pva;

6111 set pva;

6112 if target_d=. then y=0; else y=target_d;

6113 run;

NOTE: There were 9541200 observations read from the data set WORK.PVA.

NOTE: The data set WORK.PVA has 9541200 observations and 16 variables.

NOTE: DATA statement used (Total process time):

real time 1:57.75

cpu time 11.82 seconds

6114

6115 proc lifereg data=pva;

6116 model (target_d, y) = age wealth1 hit CARDPROM numprom cardpm12 numprm12

6117 RAMNTALL NGIFTALL CARDGIFT MINRAMNT LASTGIFT ;

6118 run;

NOTE: Algorithm converged.

NOTE: PROCEDURE LIFEREG used (Total process time):

real time 36.53 seconds

cpu time 13.06 seconds

One additional advantage of PROC LIFEREG is that it also can very easily handle observation-by-observation censoring, which is interval censoring in a survival analysis.

The disadvantage of PROC LIFEREG comparing to PROC QLIM in the Tobit model framework is that PROC LIFEREG only computes predicted value for uncensored observations, but in Tobit model, the expected value of censored observations is of particular interests. To do this, analyst will have to use DATA STEP to manually compute this value for each of censored observations. But this is easy.

For example, standard Tobit model usually assumes normal distribution for latent variable, and in this case, the expected observed value can be calculated as: $$ E(Y_i) = \Phi(X_i\beta/\sigma)(X_i\beta + \sigma \lambda)$$

where  $$ \sigma=SCALE $$ is from estimation and  $$\lambda=\phi(X_i\beta/\sigma)/\Phi(X_i\beta/\sigma)$$, where $$\phi(), \Phi()$$ indicate density function and cumulative probability function from standard normal distribution, respectively.

Alternatively, if a logistic regression is assumed, then $$ E(Y_i) = \sigma \log(1 + exp(X_i\beta/\sigma))$$

Even if with this extra step to calculate necessary quantities, the total time it would take is still significantly smaller than using PROC QLIM. As seen from the log, go through the whole data set computing these quantities with 9.5 million records took less than 2 minutes, so the total time using PROC LIFEREG is less than 3 minutes, that is still only 8% of what it takes by PROC QLIM. Recall that in Jason's poset, HPQLIM runs on a Greenplum Application with 35 working nods still takes 1 minutes 24 seconds to finish the fitting and scoring process on a data set with 7.7 million records.

Lastly but not the least, for this particular example, a Heckman's selection model is more appropriate because the intention to donate (the latent variable) is correlated with the amount of donation if the subject decides to donate, while a standard Tobit model literally assumes independence. And applying a Full Information Maximum Likelihood (FIML) estimation of Heckman's selection model on such a large data set is where HPQLIM really shine and there is no immediate replacement available in current SAS solution.

Reference:
SAS Institute, Inc., SAS/ETS User's Guide for v9.2, Cary, NC
SAS Institute, Inc., SAS/STAT User's Guide for v9.2, Cary, NC.

Amemiya, T. (1984), "Tobit Models: A Survey," Journal of Econometrics, 24, 3–61.


2 comments:

Arun said...

A gentle query: Asides the computing speed, are the results from Qlim and HPQlim identical? With big data, estimated parameters could 'always' be statistically significant either p < .05 or p < .01. Kindly advice.

"Asymptotically we are all dead." - Keynes , Novick

Liang Xie said...

I don't have HPQLIM licensed on the machines I use. You can, however, direct your question to Jason (This post is a response to a post on his blog which is cited in my article).