In order to calculate AUC for a given SAS data set that is already rank ordered by a binary classifier (such as linear logistic regression), where we have the binary outcome Y and rank order measurement P_0 or P_1 (for class 0 and 1 respectively), we can use PROC NPAR1WAY to obtain Wilcoxon Rank Sum statistics and from there we are able to obtain accurate measurement of AUC for this given data.
The relationship between AUC and Wilcoxon Rank Sum test statistics is: AUC = (W-W0)/(N1*N0)+0.5 where N1 and N0 are the frequency of class 1 and 0, and W0 is the Expected Sum of Ranks under H0: Randomly ordered, and W is the Wilcoxon Rank Sums.
In one application example shown below, PROC LOGISTIC reports c=0.911960, while this method calculates it as AUC=0.9119491555
%macro AUC( dsn, Target, score);
ods select none;
ods output WilcoxonScores=WilcoxonScore;
proc npar1way wilcoxon data=&dsn ;
where &Target^=.;
class &Target;
var &score;
run;
ods select all;
data AUC;
set WilcoxonScore end=eof;
retain v1 v2 1;
if _n_=1 then v1=abs(ExpectedSum - SumOfScores);
v2=N*v2;
if eof then do;
d=v1/v2;
Gini=d * 2; AUC=d+0.5;
put AUC= GINI=;
keep AUC Gini;
output;
end;
run;
%mend;
data test;
do i = 1 to 10000;
x = ranuni(1);
y=(x + rannor(2315)*0.2 > 0.35 ) ;
output;
end;
run;
ods select none;
ods output Association=Asso;
proc logistic data = test desc;
model y = x;
score data = test out = predicted ;
run;
ods select all;
data _null_;
set Asso;
if Label2='c' then put 'c-stat=' nValue2;
run;
%AUC( predicted, y, p_0);
NPAR1WAY gets AUC = 0.91766634744;LOGISTIC reports c-statistic = 0.917659
So, which one is more accurate? I would say, NPAR1WAY. The reason is that we can also use yet another procedure, PROC FREQ to verify the gini value which is 2*(AUC-0.5). Gini index is called Somers'D in PROC FREQ. Here, from NPAR1WAY, gini value is calculated as 0.8353269487, the same as reported Somer's D C|R (since the column variable is predictor)from PROC FREQ:
proc freq data=test noprint;
tables y*x/ measures;
output out=_measures measures;
run;
data _null_;
set _measures;
put _SMDCR_=;
run;
Then why not just use PROC FREQ since the coding is so simple? Well, the answer is really about the SPEED! Check the log below for a data with only 100000 observations, 37.63sec vs. 0.15 sec in real time:
3546
3547 data one;
3548 call streaminit(98676876);
3549 do id=1 to 1e5;
3550 score=ranuni(0)*1000;
3551 if score+rannor(0)>0 then y=1;
3552 else y=0;
3553 output;
3554 drop id;
3555 end;
3556 run;
NOTE: The data set WORK.ONE has 100000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds
3557
3558 proc freq data=one noprint;
3559 tables score*y/measures noprint;
3560 output out=_freq_out measures;
3561 run;
NOTE: There were 100000 observations read from the data set WORK.ONE.
NOTE: The data set WORK._FREQ_OUT has 1 observations and 27 variables.
NOTE: PROCEDURE FREQ used (Total process time):
real time 37.63 seconds
cpu time 37.56 seconds
3562
3563 data _null_;
3564 set _freq_out;
3565 AUC=_smdrc_/2 + 0.5;
3566 put "AUC = " AUC " SOMER'S D R|C = " _smdrc_;
3567 run;
AUC = 0.9995285252 SOMER'S D R|C = 0.9990570504
NOTE: There were 1 observations read from the data set WORK._FREQ_OUT.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
3568
3569 %AUC(one, y, score);
NOTE: The data set WORK.WILCOXONSCORE has 2 observations and 7 variables.
NOTE: There were 100000 observations read from the data set WORK.ONE.
WHERE y not = .;
NOTE: PROCEDURE NPAR1WAY used (Total process time):
real time 0.10 seconds
cpu time 0.09 seconds
AUC=0.9995285252 Gini=0.9990570504
NOTE: There were 2 observations read from the data set WORK.WILCOXONSCORE.
NOTE: The data set WORK.AUC has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
6 comments:
Looks great .!.
Thanks for your work in sasCommunity.
Charlie Shipp
It is awesome. Thank you so much!!
Thank you for this useful post.
I'd like to know where did you find this relationship between AUC and Wilcoxon Rank Sum Test. I'm trying to study more about it and it would really help!
Thanks
The relationship is well explained at the Wikipedia page below:
http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U
You are confusing the Gini with the accuracy ratio but you are not alone several people at SAS make the same mistake.
if you are interested in discussing this issue, then contact me via linkedin.
Jon Dickens
Post a Comment