机器算法验证 - 随机生成类似于分类模型的分数 - 吾爱随笔录

随机生成类似于分类模型的分数

机器算法验证机器学习分类鹏随机生成

2022-03-15 11:03:43

你好，数字运算者

我想生成 n 个随机分数（连同一个类标签），就好像它们是由二元分类模型产生的一样。具体来说，需要以下属性：

每个分数都在 0 和 1 之间
每个分数都与值为“0”或“1”的二进制标签相关联（后者是正类）
分数的整体精度应该是例如 0.1 (<- 生成器的参数)
带有标签“1”的分数的比率应该高于顶部的整体精度，而底部的较低（<-“模型质量”也应该是生成器的参数）
分数应该是这样的，即生成的 roc 曲线是平滑的（而不是例如标签为“1”的一堆分数位于顶部，而标签为“1”的其余分数位于底部名单）。

有谁知道如何解决这个问题？也许通过生成 roc 曲线然后从该治疗中生成点？提前致谢！

1个回答

一段时间过去了，我想我手头可能有一个解决方案。我将简要描述我的方法，以便为您提供总体思路。代码应该足以弄清楚细节。我喜欢在这里附上代码，但它很多而且 stackexchange 让这样做并不容易。我当然很乐意回答任何评论，也感谢任何批评。

代码可以在下面找到。

策略：

在区间 [0,6] 中使用Logistic 函数逼近平滑的 ROC 曲线
通过添加参数 k，可以影响曲线的形状以拟合所需的模型质量，通过 AUC（曲线下面积）测量。结果函数是。如果 k-> 0，AUC 接近 0.5（无优化），如果 k -> Inf，AUC 接近 1（最优模型）。作为一种方便的方法，k 应该在区间 [0.0001,100] 内。通过一些基本的演算，可以创建一个将 k 映射到 AUC 的函数，反之亦然。 $f_k(x)=\frac{1}{(1+exp(-k*x))}$
现在，假设您有一个与所需 AUC 相匹配的 roc 曲线，请通过 [0,1] 中的样本统一确定分数。这代表ROC 曲线上的 fpr（假阳性率）。为简单起见，分数计算为 1-fpr。
标签现在通过从伯努利分布中采样来确定，其中 p 使用此 fpr 处的 ROC 曲线的斜率和所需的分数总体精度来计算。详细说明：weight(label="1"):= slope(fpr) 乘以overallPrecision，weight(label="0"):= 1 乘以(1-overallPrecision)。将权重归一化，使它们总和为 1 以确定 p 和 1-p。

这是 AUC = 0.6 和整体精度 = 0.1 的 ROC 曲线示例（也在下面的代码中）替代文字

笔记：

得到的 AUC 与输入的 AUC 并不完全相同，实际上存在一个小误差（0.02 左右）。这个错误源于分数标签的确定方式。一个改进可能是添加一个参数来控制错误的大小。
分数设置为 1-fpr。这是任意的，因为 ROC 曲线不关心分数的外观，只要它们可以排序即可。

代码：

# This function creates a set of random scores together with a binary label
# n = sampleSize
# basePrecision = ratio of positives in the sample (also called overall Precision on stats.stackexchange)
# auc = Area Under Curve i.e. the quality of the simulated model. Must be in [0.5,1].
# 
binaryModelScores <- function(n,basePrecision=0.1,auc=0.6){
  # determine parameter of logistic function
  k <- calculateK(auc)

  res <- data.frame("score"=rep(-1,n),"label"=rep(-1,n))
  randUniform = runif(n,0,1)
  runIndex <- 1
  for(fpRate in randUniform){
    tpRate <- roc(fpRate,k)

    # slope
    slope <- derivRoc(fpRate,k)

    labSampleWeights <- c((1-basePrecision)*1,basePrecision*slope)
    labSampleWeights <- labSampleWeights/sum(labSampleWeights)

    res[runIndex,1] <- 1-fpRate # score
    res[runIndex,2] <- sample(c(0,1),1,prob=labSampleWeights) # label

    runIndex<-runIndex+1
  }
  res
} 

# min-max-normalization of x (fpr): [0,6] -> [0,1]
transformX <- function(x){
  (x-0)/(6-0) * (1-0)+0
}

# inverse min-max-normalization of x (fpr): [0,1] -> [0,6]
invTransformX <- function(invx){
  (invx-0)/(1-0) *(6-0) + 0
}

#  min-max-normalization of y (tpr): [0.5,logistic(6,k)] -> [0,1]
transformY <- function(y,k){
 (y-0.5)/(logistic(6,k)-0.5)*(1-0)+0
}

# logistic function
logistic <- function(x,k){
  1/(1+exp(-k*x))
}

# integral of logistic function
intLogistic <- function(x,k){
  1/k*log(1+exp(k*x))
}

# derivative of logistic function
derivLogistic <- function(x,k){
  numerator <- k*exp(-k*x)
  denominator <- (1+exp(-k*x))^2
  numerator/denominator
}

# roc-function, mapping fpr to tpr
roc <- function(x,k){
  transformY(logistic(invTransformX(x),k),k)
}

# derivative of the roc-function
derivRoc <- function(x,k){
    scalFactor <- 6 / (logistic(6,k)-0.5)
    derivLogistic(invTransformX(x),k) * scalFactor
}

# calculate the AUC for a given k 
calculateAUC <- function(k){
  ((intLogistic(6,k)-intLogistic(0,k))-(0.5*6))/((logistic(6,k)-0.5)*6)
}

# calculate k for a given auc
calculateK <- function(auc){
  f <- function(k){
      return(calculateAUC(k)-auc)
  }  
  if(f(0.0001) > 0){
     return(0.0001)
  }else{  
    return(uniroot(f,c(0.0001,100))$root)
  }
}

# Example
require(ROCR)

x <- seq(0,1,by=0.01)
k <- calculateK(0.6)
plot(x,roc(x,k),type="l",xlab="fpr",ylab="tpr",main=paste("ROC-Curve for AUC=",0.6," <=> k=",k))

dat <- binaryModelScores(1000,basePrecision=0.1,auc=0.6)

pred <- prediction(dat$score,as.factor(dat$label))
performance(pred,measure="auc")@y.values[[1]]
perf <- performance(pred, measure = "tpr", x.measure = "fpr") 
plot(perf,main="approximated ROC-Curve (random generated scores)")

其它你可能感兴趣的问题

上一篇如何从稀疏矩阵中提取有意义的因子？下一篇因果推理中可交换性和独立性之间的区别