机器算法验证 - 如何估计 LC50 的置信区间 - 吾爱随笔录

如何估计 LC50 的置信区间

机器算法验证 r 回归物流广义线性模型剂量反应

2022-04-18 04:25:25

这是我的第一个问题，所以我希望问题得到妥善处理（如果不是，我很抱歉......）

我正在使用二项式 GLM 模型 (logit) 来获取一些毒理学数据，以研究农药对某些生物体的影响。相同数量的生物体暴露于不同浓度的农药，并在实验结束时评估存活率。数据和模型如下：

## data:
Concentration <- as.numeric(c(0, 100, 200, 300, 400, 500, 600, 700, 800))
Survival <- as.integer(c(20, 20, 20, 18, 13, 8, 3, 0, 0))
Total <- as.integer(c(20, 20, 20, 20, 20, 20, 20, 20, 20))
Data <- data.frame(Concentration, Survival, Total)
Data$Dead <- Data$Total-Data$Survival
Data$Proport_Surv <- (Data$Total - Data$Dead) / Data$Total

mod <- glm(cbind(Survival, Dead) ~ Concentration, Data, family=binomial(link = "logit"))

这是摘要输出：

    Call:
    glm(formula = cbind(Survival, Dead) ~ Concentration, family = binomial(link = "logit"), data = Data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0156  -0.4775   0.1917   0.4356   0.8721  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    6.992127   1.121066   6.237 4.46e-10 ***
Concentration -0.015196   0.002366  -6.423 1.34e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 163.5934  on 8  degrees of freedom
Residual deviance:   3.2464  on 7  degrees of freedom
AIC: 19.401

Number of Fisher Scoring iterations: 5

毒理学中推断任何化学物质毒性的一个重要输出是估计 LC50 值，这是模型估计的导致 50% 死亡率的浓度。我使用了 MASS 的 dose.p() 函数。对于这个数据集：

library(MASS)
dose.p(mod, p=0.5)
             Dose       SE
p = 0.5: 460.1353 18.16643

该农药的 LC50 为 460.1353，但该值应附有置信区间。我的疑问是：如何估计 X 值的置信区间？绘制模型有助于理解我的输出应该是什么：

计算模型的置信区间：

NewData1 <- expand.grid(Conc  = seq(0, 800, length = 9))

P_logit <- predict(mod, newdata = NewData1, se = TRUE, type = "link")
NewData1$P_logit <- exp(P_logit$fit) / (1 + exp(P_logit$fit))
NewData1$SeUp <- exp(P_logit$fit + 1.96*P_logit$se.fit) / (1 + exp(P_logit$fit + 1.96*P_logit$se.fit))
NewData1$SeLo <- exp(P_logit$fit - 1.96*P_logit$se.fit) / (1 + exp(P_logit$fit - 1.96*P_logit$se.fit))

阴谋：

library(ggplot2)

plot <- ggplot()
plot <- plot + geom_point(data = Data, aes(y = Proport_Surv, x = Concentration), shape = 1, size = 2.5)
plot <- plot + xlab("Concentration") + ylab("Proportion of surviving organisms")
plot <- plot + theme(text = element_text(size=15)) + theme_bw()
plot <- plot + geom_line(data = NewData1, aes(x = Concentration, y = P_logit), colour = "black")
plot <- plot + geom_ribbon(data = NewData1, aes(x = Concentration, ymax = SeUp, ymin = SeLo), alpha = 0.2)
plot <- plot + geom_hline(yintercept = 0.5, linetype = "dashed")
plot <- plot + annotate("point", x = 460.1353, y = 0.5, size = 3.25, colour = "blue")
plot

蓝点标注是模型估计的LC50。如何估计虚线与模型的上下置信区间相交的值？

2个回答

最实用的方法是使用可以在 Venables, WN 和 Ripley, BD (2002) Modern Applied Statistics with S. Springer 中找到的调整 dose.p 函数。

下面的代码使用 95% CI 的 Wald 统计量

ec = 0.5
library(VGAM)
eta <- logit(ec) 
beta <- coef(mod)[1:2] 
ecx <- (eta - beta[1])/beta[2] 
pd <- -cbind(1, ecx)/beta[2] 
ff = as.matrix(vcov(mod)[1:2,1:2])
se <- sqrt(((pd %*% ff )* pd) %*% c(1, 1))
upper = (ecx+se*1.96)
lower = (ecx-se*1.96)
df1 = data.frame(ecx, lower, upper)


plot <- plot + geom_vline(xintercept = df1$ecx, linetype = "dashed")
plot <- plot + geom_vline(xintercept = df1$lower, linetype = "dashed")
plot <- plot + geom_vline(xintercept = df1$upper, linetype = "dashed")
plot

让我们回到这里的首要原则，看看你想要估计什么。具有二项式族和 logit 链接函数的 GLM 只是逻辑回归模型，因此我们可以使用该模型的回归方程。让 $p \equiv \mathbb{P}(Y=1)$ 是积极响应结果的概率（这里是生物体的死亡），您的模型基于模型方程：

\log (\frac{p}{1 - p}) = β_{0} + β_{1} x .

$\log \Big( \frac{p}{1-p} \Big) = \beta_0 + \beta_1 x.$

环境 $p = \tfrac{1}{2}$ 给出相应的解释值（此处为 LC50 浓度）：

x_{*} \equiv \frac{\log (\frac{1}{2} / \frac{1}{2}) - β_{0}}{β_{1}} = \frac{\log (1) - β_{0}}{β_{1}} = - \frac{β_{0}}{β_{1}} .

$x_* \equiv \frac{\log(\tfrac{1}{2}/\tfrac{1}{2}) - \beta_0}{\beta_1} = \frac{\log(1) - \beta_0}{\beta_1} = - \frac{\beta_0}{\beta_1}.$

因此，您实际上是在尝试找到逻辑回归中真实系数比率的置信区间。通常我们通过正态分布来近似估计系数的联合分布（通过求助于中心极限定理 $n$ )，因此我们试图根据观察到的估计量找到真实比率的置信区间，这些估计量被假定为围绕真实值联合正态分布（具有非零相关性）。（可以使用vcov模型上的命令找到估计量的估计方差 - 协方差矩阵。）

此处的另一个答案已经向您展示了在这种情况下如何使用自举来获得置信区间，但是如果您愿意，可以使用一系列分析近似值；有关此类置信区间的分析形式的一些示例，请参见例如Malley (1982)、Shanmugalingam (1982)和Dunlap 和 Silver (1986)。

其它你可能感兴趣的问题

上一篇什么是因子分析中的*共同方差*，它是如何估计的？下一篇如何改进这个时间序列模型？