机器算法验证 - 泊松回归的 Log Link 与 Identity Link 的优缺点 - 吾爱随笔录

泊松回归的 Log Link 与 Identity Link 的优缺点

机器算法验证物流泊松回归链接功能

2022-02-05 14:30:07

我正在执行泊松回归，最终目标是比较（并获取差异）我模型中两个因子水平之间的预测平均数： $\hat{\mu}_1-\hat{\mu}_2$ ，同时保持其他模型协变量（它们都是二进制的）不变。我想知道是否有人可以就何时使用日志链接和身份链接提供一些实用建议。考虑到我比较差异的目标，泊松回归中这两个不同的链接函数的优缺点是什么？

对于逻辑/二项式回归（使用 logit 链接或身份链接）来比较两个因子水平之间的比例差异并需要类似的建议，我也有相同的目标。我在这里阅读了一些涉及此问题的帖子，但似乎没有人解释为什么或何时可能会选择一个链接而不是另一个链接以及利弊可能是什么。在此先感谢您的帮助！

更新：

我还意识到使用某些链接函数的主要目的是将可能预测值的范围限制在平均响应的范围内（例如，对于逻辑，范围限制在 0 和 1 之间，对于日志链接，预测被限制为正数）。所以，我想我要问的是，如果我使用身份链接来表示逻辑/二项式回归，并且我的结果在范围 (0,1) 内，是否真的需要使用逻辑链接函数或我可以让使用身份链接的想法更简单吗？

2个回答

在泊松回归的情况下，身份链接的缺点是：

正如您所提到的，它可以产生超出范围的预测。
尝试拟合模型时，您可能会收到奇怪的错误和警告，因为该链接允许 lambda 小于 0，但未为此类值定义泊松分布。
由于泊松回归假设均值和方差相同，因此当您更改链接时，您也更改了关于方差的假设。我的经验是，最后一点最能说明问题。

但是，归根结底，这是一个经验问题。适合两种型号。执行您喜欢的任何检查。如果身份链接的 AIC 较低，并且在所有其他检查中表现良好或更好，则使用身份链接运行。

在 logit 模型与线性概率模型（即您所说的身份链接）的情况下，情况要简单得多。除了计量经济学中一些非常奇特的案例（如果您进行搜索，您会发现），logit 模型更好：它做出的假设更少，并且是大多数人使用的。使用线性概率模型来代替它几乎是不正当的。

至于解释模型，如果你使用 R，有两个很棒的包可以完成所有繁重的工作：效果，超级容易使用，以及zelig，更难使用，但如果你想进行预测，那就太好了.

在泊松模型的情况下，我还要说，应用程序通常会决定您的协变量是加法（这意味着身份链接）还是线性尺度上的乘法（这意味着对数链接）。但是具有恒等链接的泊松模型通常也只有在对拟合系数施加非负性约束时才有意义并且只能稳定拟合 - 这可以使用nnpoisRaddreg包中的nnlm函数或使用NNLM包裹。因此，我不同意将泊松模型与身份和日志链接同时拟合，并查看哪个最终具有最佳 AIC 并基于纯统计基础推断出最佳模型——相反，在大多数情况下，它是由试图解决的问题或手头数据的底层结构。

例如，在色谱法（GC/MS 分析）中，人们通常会测量几个近似高斯形峰的叠加信号，而这个叠加信号是用电子倍增器测量的，这意味着测量的信号是离子计数，因此是泊松分布的。由于根据定义，每个峰都有一个正的高度并且具有加性作用，并且噪声是泊松，因此在这里使用具有恒等链接的非负泊松模型是合适的，而对数链接泊松模型将是完全错误的。在工程中，Kullback-Leibler 损失通常用作此类模型的损失函数，最小化这种损失相当于优化非负身份链接泊松模型的可能性（还有其他散度/损失度量，如alpha 或 beta 散度以泊松为特例）。

下面是一个数值示例，包括一个常规无约束身份链接 Poisson GLM 不适合的演示（因为缺少非负约束）以及有关如何使用非负身份链接 Poisson 模型拟合的一些细节nnpois，这里是在使用包含单个峰的测量形状的移动副本的带状协变量矩阵对色谱峰的测量叠加与泊松噪声进行去卷积的背景下。这里的非负性很重要，原因如下：（1）它是手头数据的唯一现实模型（这里的峰不能有负高度），（2）它是稳定地拟合具有恒等链接的泊松模型的唯一方法（如否则，某些协变量值的预测可能会变为负值，这没有意义，并且在尝试评估可能性时会出现数值问题），（3）非负性起到规范回归问题的作用，并且极大地有助于获得稳定的估计（例如您通常不会像普通的无约束回归那样遇到过度拟合问题，非负性约束导致更稀疏的估计，通常更接近基本事实；对于下面的反卷积问题，例如性能与 LASSO 正则化一样好，但不需要调整任何正则化参数。（L0-pseudonorm 惩罚回归的性能仍然稍好，但计算成本更高）

# we first simulate some data
require(Matrix)
n = 200
x = 1:n
npeaks = 20
set.seed(123)
u = sample(x, npeaks, replace=FALSE) # unkown peak locations
peakhrange = c(10,1E3) # peak height range
h = 10^runif(npeaks, min=log10(min(peakhrange)), max=log10(max(peakhrange))) # unknown peak heights
a = rep(0, n) # locations of spikes of simulated spike train, which are assumed to be unknown here, and which needs to be estimated from the measured total signal
a[u] = h
gauspeak = function(x, u, w, h=1) h*exp(((x-u)^2)/(-2*(w^2))) # peak shape function
bM = do.call(cbind, lapply(1:n, function (u) gauspeak(x, u=u, w=5, h=1) )) # banded matrix with peak shape measured beforehand
y_nonoise = as.vector(bM %*% a) # noiseless simulated signal = linear convolution of spike train with peak shape function
y = rpois(n, y_nonoise) # simulated signal with random poisson noise on it - this is the actual signal as it is recorded
par(mfrow=c(1,1))
plot(y, type="l", ylab="Signal", xlab="x", main="Simulated spike train (red) to be estimated given known blur kernel & with Poisson noise")
lines(a, type="h", col="red")

# let's now deconvolute the measured signal y with the banded covariate matrix containing shifted copied of the known blur kernel/peak shape bM

# first observe that regular OLS regression without nonnegativity constraints would return very bad nonsensical estimates
weights <- 1/(y+1) # let's use 1/variance = 1/(y+eps) observation weights to take into heteroscedasticity caused by Poisson noise
a_ols <- lm.fit(x=bM*sqrt(weights), y=y*sqrt(weights))$coefficients # weighted OLS
plot(x, y, type="l", main="Ground truth (red), unconstrained OLS estimate (blue)", ylab="Peak shape", xlab="x", ylim=c(-max(y),max(y)))
lines(x,-y)
lines(a, type="h", col="red", lwd=2)
lines(-a_ols, type="h", col="blue", lwd=2)

# now we use weighted nonnegative least squares with 1/variance obs weights as an approximation of nonnegative Poisson regression
# this gives very good estimates & is very fast
library(nnls)
library(microbenchmark)
microbenchmark(a_wnnls <- nnls(A=bM*sqrt(weights),b=y*sqrt(weights))$x) # 7 ms
plot(x, y, type="l", main="Ground truth (red), weighted nnls estimate (blue)", ylab="Signal (black) & peaks (red & blue)", xlab="Time", ylim=c(-max(y),max(y)))
lines(x,-y)
lines(a, type="h", col="red", lwd=2)
lines(-a_wnnls, type="h", col="blue", lwd=2)
# note that this weighted least square estimate in almost identical to  the nonnegative Poisson estimate below and that it fits way faster!!!

# an unconstrained identity-link Poisson GLM will not fit:
glmfit = glm.fit(x=as.matrix(bM), y=y, family=poisson(link=identity), intercept=FALSE)
# returns Error: no valid set of coefficients has been found: please supply starting values

# so let's try a nonnegativity constrained identity-link Poisson GLM, fit using bbmle (using port algo, ie Quasi Newton BFGS):
library(bbmle)
XM=as.matrix(bM)
colnames(XM)=paste0("v",as.character(1:n))
yv=as.vector(y)
LL_poisidlink <- function(beta, X=XM, y=yv){ # neg log-likelihood function
  -sum(stats::dpois(y, lambda = X %*% beta, log = TRUE)) # PS regular log-link Poisson would have exp(X %*% beta)
}
parnames(LL_poisidlink) <- colnames(XM)
system.time(fit <- mle2(
  minuslogl = LL_poisidlink ,
  start = setNames(a_wnnls+1E-10, colnames(XM)), # we initialise with weighted nnls estimates, with approx 1/variance obs weights
  lower = rep(0,n),
  vecpar = TRUE,
  optimizer = "nlminb"
)) # very slow though - takes 145s 
summary(fit)
a_nnpoisbbmle = coef(fit)
plot(x, y, type="l", main="Ground truth (red), nonnegative Poisson bbmle ML estimate (blue)", ylab="Signal (black) & peaks (red & blue)", xlab="Time", ylim=c(-max(y),max(y)))
lines(x,-y)
lines(a, type="h", col="red", lwd=2)
lines(-a_nnpoisbbmle, type="h", col="blue", lwd=2)

# much faster is to fit nonnegative Poisson regression using nnpois using an accelerated EM algorithm:
library(addreg)
microbenchmark(a_nnpois <- nnpois(y=y,
                                  x=as.matrix(bM),
                                  standard=rep(1,n),
                                  offset=0,
                                  start=a_wnnls+1.1E-4, # we start from weighted nnls estimates 
                                  control = addreg.control(bound.tol = 1e-04, epsilon = 1e-5),
                                  accelerate="squarem")$coefficients) # 100 ms
plot(x, y, type="l", main="Ground truth (red), nonnegative Poisson nnpois estimate (blue)", ylab="Signal (black) & peaks (red & blue)", xlab="Time", ylim=c(-max(y),max(y)))
lines(x,-y)
lines(a, type="h", col="red", lwd=2)
lines(-a_nnpois, type="h", col="blue", lwd=2)

# or to fit nonnegative Poisson regression using nnlm with Kullback-Leibler loss using a coordinate descent algorithm:
library(NNLM)
system.time(a_nnpoisnnlm <- nnlm(x=as.matrix(rbind(bM)),
                                 y=as.matrix(y, ncol=1),
                                 loss="mkl", method="scd",
                                 init=as.matrix(a_wnnls, ncol=1),
                                 check.x=FALSE, rel.tol=1E-4)$coefficients) # 3s
plot(x, y, type="l", main="Ground truth (red), nonnegative Poisson nnlm estimate (blue)", ylab="Signal (black) & peaks (red & blue)", xlab="Time", ylim=c(-max(y),max(y)))
lines(x,-y)
lines(a, type="h", col="red", lwd=2)
lines(-a_nnpoisnnlm, type="h", col="blue", lwd=2)

其它你可能感兴趣的问题

上一篇检测录音中的峰值数量下一篇量化两个数据集之间的相似性