机器算法验证 - 每个协变量对逻辑回归模型中单个预测的贡献 - 吾爱随笔录

例如，假设我们有一个逻辑回归模型，该模型根据许多协变量输出患者患上特定疾病的概率。

通过检查模型的系数并考虑优势比的变化，我们可以大致了解每个协变量的影响的大小和方向。

如果我们想为单个患者了解他或她最大的风险因素/对他或她有利的最大因素怎么办。我对患者实际上可以做些什么的那些特别感兴趣。

做这个的最好方式是什么？

我目前正在考虑的方式在以下 R 代码中捕获（取自此线程）：

#Derived from Collett 'Modelling Binary Data' 2nd Edition p.98-99
#Need reproducible "random" numbers.
seed <- 67

num.students <- 1000
which.student <- 1

#Generate data frame with made-up data from students:
set.seed(seed) #reset seed
v1 <- rbinom(num.students,1,0.7)
v2 <- rnorm(length(v1),0.7,0.3)
v3 <- rpois(length(v1),1)

#Create df representing students
students <- data.frame(
    intercept = rep(1,length(v1)),
    outcome = v1,
    score1 = v2,
    score2 = v3
 )
 print(head(students))

predict.and.append <- function(input){
    #Create a vanilla logistic model as a function of score1 and score2
    data.model <- glm(outcome ~ score1 + score2, data=input, family=binomial)

    #Calculate predictions and SE.fit with the R package's internal method
    # These are in logits.
    predictions <- as.data.frame(predict(data.model, se.fit=TRUE,      type='link'))

    predictions$actual <- input$outcome
    predictions$lower <- plogis(predictions$fit - 1.96 * predictions$se.fit)
    predictions$prediction <- plogis(predictions$fit)
    predictions$upper <- plogis(predictions$fit + 1.96 * predictions$se.fit)


    return (list(data.model, predictions))
}

output <- predict.and.append(students)

data.model <- output[[1]]

#summary(data.model)

#Export vcov matrix 
model.vcov <- vcov(data.model)

# Now our goal is to reproduce 'predictions' and the se.fit manually using the      vcov matrix
this.student.predictors <- as.matrix(students[which.student,c(1,3,4)])

#Prediction:
this.student.prediction <- sum(this.student.predictors * coef(data.model))
square.student <- t(this.student.predictors) %*% this.student.predictors
se.student <- sqrt(sum(model.vcov * square.student))

manual.prediction <- data.frame(lower = plogis(this.student.prediction -    1.96*se.student), 
    prediction = plogis(this.student.prediction), 
    upper = plogis(this.student.prediction + 1.96*se.student))

print("Data preview:")
print(head(students))
print(paste("Point estimate of the outcome probability for student",     which.student,"(2.5%, point prediction, 97.5%) by Collett's procedure:"))
manual.prediction
print(paste("Point estimate of the outcome probability for student",     which.student,"(2.5%, point prediction, 97.5%) by R's predict.glm:"))    
print(output[[2]][which.student,c('lower','prediction','upper')])

我正在考虑另外看

this.student.prediction.list <- this.student.predictors * coef(data.model)

并试图从作为概率估计的总和的各个加数中获取信息，但我不知道该怎么做。

我可以看看

哪些变量对概率估计的绝对贡献最大，并将其作为最大的风险因素。
哪些变量与其平均比例的差异最大，即查看每个变量对概率估计的平均贡献比例，并查看在此特定观察中哪些变量与该比例的差异最大
其组合：用平均比例对平均比例与观察比例之间的绝对差进行加权，并取权重值最大的变量

其中哪一个最有意义？这些方法中的任何一种都是回答这个问题的合理方法吗？

此外，我想知道如何获得单个协变量对概率估计的加性贡献的置信区间。