Friday, October 23, 2009

AUC calculation using Wilcoxon Rank Sum Test

Accurately Calculate AUC (Area Under the Curve) in SAS for a binary classifier rank ordered data

In order to calculate AUC for a given SAS data set that is already rank ordered by a binary classifier (such as linear logistic regression), where we have the binary outcome Y and rank order measurement P_0 or P_1 (for class 0 and 1 respectively), we can use PROC NPAR1WAY to obtain Wilcoxon Rank Sum statistics and from there we are able to obtain accurate measurement of AUC for this given data.

The relationship between AUC and Wilcoxon Rank Sum test statistics is: AUC = (W-W0)/(N1*N0)+0.5 where N1 and N0 are the frequency of class 1 and 0, and W0 is the Expected Sum of Ranks under H0: Randomly ordered, and W is the Wilcoxon Rank Sums.

In one application example shown below, PROC LOGISTIC reports c=0.911960, while this method calculates it as AUC=0.9119491555



%macro AUC( dsn, Target, score);
ods select none;
ods output WilcoxonScores=WilcoxonScore;
proc npar1way wilcoxon data=&dsn ;
     where &Target^=.;
     class &Target;
     var  &score; 
run;
ods select all;

data AUC;
    set WilcoxonScore end=eof;
    retain v1 v2 1;
    if _n_=1 then v1=abs(ExpectedSum - SumOfScores);
    v2=N*v2;
    if eof then do;
       d=v1/v2;
       Gini=d * 2;    AUC=d+0.5;    
       put AUC=  GINI=;
       keep AUC Gini;
     output;
   end;
run;
%mend;

data test;
  do i = 1 to 10000;
     x = ranuni(1);
     y=(x + rannor(2315)*0.2 > 0.35 ) ; 
     output;
  end;
run;

ods select none;
ods output Association=Asso;
proc logistic data = test desc;
    model y = x;
    score data = test out = predicted ; 
run;
ods select all;

data _null_;
     set Asso;
     if Label2='c' then put 'c-stat=' nValue2;
run;
%AUC( predicted, y, p_0);

NPAR1WAY gets        AUC = 0.91766634744;
LOGISTIC reports c-statistic = 0.917659

So, which one is more accurate? I would say, NPAR1WAY. The reason is that we can also use yet another procedure, PROC FREQ to verify the gini value which is 2*(AUC-0.5). Gini index is called Somers'D in PROC FREQ. Here, from NPAR1WAY, gini value is calculated as 0.8353269487, the same as reported Somer's D C|R (since the column variable is predictor)from PROC FREQ:



proc freq data=test noprint;
     tables y*x/ measures;
     output out=_measures measures;
run;

data _null_;
     set _measures;
     put _SMDCR_=;
run; 

Then why not just use PROC FREQ since the coding is so simple? Well, the answer is really about the SPEED! Check the log below for a data with only 100000 observations, 37.63sec vs. 0.15 sec in real time:


3546
3547  data one;
3548       call streaminit(98676876);
3549       do id=1 to 1e5;
3550          score=ranuni(0)*1000;
3551          if score+rannor(0)>0 then y=1;
3552          else y=0;
3553          output;
3554          drop id;
3555       end;
3556  run;

NOTE: The data set WORK.ONE has 100000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.04 seconds


3557
3558  proc freq data=one noprint;
3559       tables score*y/measures noprint;
3560       output  out=_freq_out  measures;
3561  run;

NOTE: There were 100000 observations read from the data set WORK.ONE.
NOTE: The data set WORK._FREQ_OUT has 1 observations and 27 variables.
NOTE: PROCEDURE FREQ used (Total process time):
      real time           37.63 seconds
      cpu time            37.56 seconds


3562
3563  data _null_;
3564       set _freq_out;
3565       AUC=_smdrc_/2 + 0.5;
3566       put "AUC = " AUC "   SOMER'S D R|C = " _smdrc_;
3567  run;

AUC = 0.9995285252    SOMER'S D R|C = 0.9990570504
NOTE: There were 1 observations read from the data set WORK._FREQ_OUT.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


3568
3569  %AUC(one, y, score);

NOTE: The data set WORK.WILCOXONSCORE has 2 observations and 7 variables.
NOTE: There were 100000 observations read from the data set WORK.ONE.
      WHERE y not = .;
NOTE: PROCEDURE NPAR1WAY used (Total process time):
      real time           0.10 seconds
      cpu time            0.09 seconds



AUC=0.9995285252 Gini=0.9990570504
NOTE: There were 2 observations read from the data set WORK.WILCOXONSCORE.
NOTE: The data set WORK.AUC has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds

6 comments:

Charlie Shipp Family said...

Looks great .!.

Thanks for your work in sasCommunity.

Charlie Shipp

eskimokitty said...

It is awesome. Thank you so much!!

raspcompote said...

Thank you for this useful post.

Luis Gustavo said...

I'd like to know where did you find this relationship between AUC and Wilcoxon Rank Sum Test. I'm trying to study more about it and it would really help!

Thanks

Liang Xie said...

The relationship is well explained at the Wikipedia page below:
http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U

Jon Dickens said...

You are confusing the Gini with the accuracy ratio but you are not alone several people at SAS make the same mistake.

if you are interested in discussing this issue, then contact me via linkedin.

Jon Dickens