机器算法验证 - 权重和偏移量能否在泊松回归中产生类似的结果？ - 吾爱随笔录

权重和偏移量能否在泊松回归中产生类似的结果？

机器算法验证广义线性模型造型泊松回归权重抵消

2022-03-10 19:18:02

“在泊松乘法 GLM 的特定情况下，可以表明，使用等于暴露对数的偏移项对索赔计数进行建模产生的结果与对先前权重设置为等于每个观察的暴露的索赔频率建模产生相同的结果。 "

我无法找到该结果的任何进一步参考，因此我进行了一些经验测试，在这些测试中我无法找到该陈述正确的证据。任何人都可以提供一些关于为什么这个结果可能是正确/错误的见解。

仅供参考，我使用以下 R 代码来测试假设，在其中我无法为提到的两个案例获得类似的结果：

n=1000
m=10

# Generate random data
X = matrix(data = rnorm(n*m)+1, ncol = m, nrow = n)

intercept = 2
coefs = runif(m)
offset = runif(n)
## DGP: exp of Intercept + linear combination X variables + log(offset)
mu = exp(intercept + X%*%coefs + log(offset))
y = rpois(n=n, lambda=mu)

df = data.frame('y'=y, 'X'=X, 'offset' = offset)
formula = paste("y ~",paste(colnames(df)[grepl("X", colnames(df))], collapse = "+"))

#First model using log(offset) as offset
fit1  = glm(formula, family = "poisson", df, offset = log(offset))
#Second model using offset as weights for individual observations
fit2 = glm(formula, family = "poisson", df, weights = offset) 
#Third model using poisson model on y/offset as reference
dfNew = df
dfNew$y = dfNew$y/offset
fit3 = glm(formula, family = "poisson", dfNew)

#Combine coefficients with the true coefficients
rbind(fit1$coefficients, fit2$coefficients, fit3$coefficients, c(intercept,coefs))

运行此代码产生的系数估计值如下：

 >  
    (Intercept)       X.1       X.2       X.3        X.4       X.5       X.6
[1,]    1.998277 0.2923091 0.4586666 0.1802960 0.11688860 0.7997154 0.4786655
[2,]    1.588620 0.2708272 0.4540180 0.1901753 0.07284985 0.7928951 0.5100480
[3,]    1.983903 0.2942196 0.4593369 0.1782187 0.11846876 0.8018315 0.4807802
[4,]    2.000000 0.2909240 0.4576965 0.1807591 0.11658183 0.8005451 0.4780123
              X.7       X.8       X.9      X.10
[1,]  0.005772078 0.9154808 0.9078758 0.3512824
[2,] -0.003705015 0.9117014 0.9063845 0.4155601
[3,]  0.007595660 0.9181014 0.9076908 0.3505173
[4,]  0.005881960 0.9150350 0.9084375 0.3511749
>

我们可以观察到系数不相同。

2个回答

（使用您的 R 代码，您可以将“poisson”替换为“quasipoisson”以避免生成所有警告。导入的任何其他内容都不会改变。请参见下面的 (*)）。您的参考使用术语“乘法 glm”，我认为它仅表示带有日志链接的 glm，因为可以将日志链接视为乘法模型。您自己的示例表明该声明是错误的，至少在我们解释它时（因为估计的参数不相等）。你可以写信给作者，问他们是什么意思。下面我将论证为什么这种说法是错误的。

设为泊松参数，为权重。设是没有偏移的线性预测器，然后是有偏移的线性预测器。泊松概率函数是那么带有偏移量的模型的对数似然函数变为而具有权重的模型的对数似然函数变为 $\lambda_i$ $\omega_i$ $\eta_i$ $\eta_i+\log(\omega_i)$

f (y_{i}) = e^{- λ_{i}} λ_{i}^{y_{i}} / y_{i}!

$f(y_i) = e^{-\lambda_i} \lambda_i^{y_i}/y_i !$

ℓ = - \sum_{i} ω_{i} e^{η_{i}} + \sum_{i} y_{i} η_{i} + \sum_{i} y_{i} \log ω_{i} - \sum_{i} \log y_{i}!

$\ell = -\sum_i \omega_i e^{\eta_i} + \sum_i y_i \eta_i +\sum_i y_i\log \omega_i - \sum_i \log y_i!$

ℓ^{w} = - \sum_{i} ω_{i} e^{η_{i}} + \sum_{i} y_{i} ω_{i} η_{i} - \sum_{i} ω_{i} \log y_{i}!

$\ell^w = -\sum_i \omega_i e^{\eta_i}+\sum_i y_i \omega_i \eta_i -\sum_i \omega_i \log y_i!$ 这显然不一样。所以这些作者的意思我不清楚。

(*) 借助 Rglm函数的注意事项：

非“NULL”“权重”可用于表示不同的观测值具有不同的离散度（“权重”中的值与离散度成反比）；或者等效地，当“权重”的元素是正整数 w_i 时，每个响应 y_i 是 w_i 单位权重观测值的平均值。对于二项式 GLM，当响应是成功的比例时，先验权重用于给出试验次数：它们很少用于 Poisson GLM。

研究权重参数的含义解释了这一点，它对泊松族函数几乎没有意义，它假设一个恒定的尺度参数而权重参数修改。这确实赋予了 quasiposson 家族函数更多的意义。请参阅R 中 glm 和 lm 函数中“权重”输入的答案那里给出的答案也有助于了解为什么加权情况下的可能性采用上面给出的形式。 $\phi=1$ $\phi$

这里给出的答案可能是相关的：泊松率回归如何等于具有相应偏移项的泊松回归？并且非常有趣。

很抱歉没有简单地在上面添加评论，但我没有足够的代表。

最初的主张——但稍作修改——实际上是正确的。

以下两个模型使用带有 log-link 的 poisson glm 在 R 中给出完全相同的答案：

模型 y，使用偏移量为 log(offset)
模型 y / offset，使用等于 offset 的权重

稍微调整原始代码会显示相同的答案：

n=1000
m=10

# Generate random data
X = matrix(data = rnorm(n*m)+1, ncol = m, nrow = n)

intercept = 2
coefs = runif(m)
offset = runif(n)
## DGP: exp of Intercept + linear combination X variables + log(offset)
mu = exp(intercept + X%*%coefs + log(offset))
y = rpois(n=n, lambda=mu)

df = data.frame('y' = y,
                'y_over_offset' = y/offset,
                'X' = X,
                'offset' = offset)

formula_offset = paste("y ~",paste(colnames(df)[grepl("X", colnames(df))], collapse = "+"))
formula_weights = paste("y_over_offset ~",paste(colnames(df)[grepl("X", colnames(df))], collapse = "+"))

#First model using log(offset) as offset
fit1  = glm(formula_offset, family = "poisson", df, offset = log(offset))
#Second model using offset as weights for individual observations
fit2 = glm(formula_weights, family = "poisson", df, weights = offset) 


#Combine coefficients with the true coefficients
rbind(fit1$coefficients, fit2$coefficients, c(intercept,coefs))

希望这应该给出相同的答案。

有可能表明这两个模型在统计上是等效的（某处有一篇 CAS 论文显示了这一点——如果我有时间，我会发布一个链接）。

顺便说一句，如果你正在做惩罚回归，那么不同的包（如 glmnet 和 H2o）测量定义模型的两种不同方式的偏差的方式可能会导致不同的结果。

其它你可能感兴趣的问题

上一篇PCA 之前的标准化与缩放下一篇逻辑回归和特征缩放