数据挖掘 - 在 R 中生成 KS 统计量的好方法是什么？ - 吾爱随笔录

Kolmogorov-Smirnov (KS) 统计是评估营销或信用风险模型预测能力的常用指标之一。

KS 统计量通常针对逻辑回归问题发布，以指示模型的质量。

维基百科页面（https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test）给出了这样的解释：

下面的 KS 统计计算来自https://cran.r-project.org/doc/contrib/Sharma-CreditScoring.pdf上的第 5 页

require(ROCR)
set.seed(7)
prd=runif(1000)
act=round(prd)
prd[sample(1000,500)]=runif(500) #noise

pred<-prediction(prd,act)
perf <- performance(pred,"tpr","fpr")
#this code builds on ROCR library by taking the max delt
#between cumulative bad and good rates being plotted by
#ROCR see https://cran.r-project.org/doc/contrib/Sharma-CreditScoring.pdf
ks=max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])
plot(perf,main=paste0(' KS=',round(ks*100,1),'%'))
lines(x = c(0,1),y=c(0,1))
print(ks);
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
print(auc)

它给出了 0.4939511 的 KS 和 0.7398465 的 AUC。可以看出，曲线下的面积显然是正确的。

KS 统计量的 R 代码是否正确？如果不是，应该是什么？

更新

警告：从一个非常块状的样本决策树输出中，Sharma 方法给出了更好的 KS 分数估计，而 ks.test 给出了一个非常糟糕的估计。

     auc         ks.score.D^+ sharma.ks
    0.6153846    0.7045455    0.2307692

ks.test 函数给出了这个错误：1: In ks.test(prd, act, alternative = "greater") : cannot compute exact p-value with ties

act=c(1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 
  0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 
  0, 0, 1)
prd=c(0.914893617021277, 0.914893617021277, 0.914893617021277, 
  0.914893617021277, 0.914893617021277, 0.914893617021277, 0.914893617021277, 
  0.914893617021277, 0.914893617021277, 0.914893617021277, 0.914893617021277, 
  0.914893617021277, 0.914893617021277, 0.914893617021277, 0.914893617021277, 
  0.914893617021277, 0.914893617021277, 0.914893617021277, 0.352941176470588, 
  0.914893617021277, 0.914893617021277, 0.914893617021277, 0.914893617021277, 
  0.914893617021277, 0.914893617021277, 0.914893617021277, 0.914893617021277, 
  0.914893617021277, 0.914893617021277, 0.914893617021277, 0.914893617021277, 
  0.914893617021277, 0.914893617021277, 0.914893617021277, 0.914893617021277, 
  0.914893617021277, 0.914893617021277, 0.914893617021277, 0.914893617021277, 
  0.914893617021277, 0.352941176470588, 0.352941176470588, 0.914893617021277, 
  0.914893617021277) ##this only has two values
ks.test(prd, act, alternative='greater')$statistic
#gives 0.7045455 which clearly isn't correct