从简单的 R lm 模型重新计算对数似然

机器算法验证 r 广义线性模型 可能性 流明
2022-03-18 05:40:02

我只是试图用 dnorm() 从 lm 模型(在 R 中)重新计算 logLik 函数提供的对数似然。

它适用于(几乎完美)大量数据(例如 n=1000):

> n <- 1000
> x <- 1:n
> set.seed(1)
> y <- 10 + 2*x + rnorm(n, 0, 2)
> mod <- glm(y ~ x, family = gaussian)
> logLik(mod)
'log Lik.' -2145.562 (df=3)
> sigma <- sqrt(summary(mod)$dispersion)
> sum(log(dnorm(x = y, mean = predict(mod), sd = sigma)))
[1] -2145.563
> sum(log(dnorm(x = resid(mod), mean = 0, sd = sigma)))
[1] -2145.563

但对于小型数据集,存在明显差异:

> n <- 5
> x <- 1:n
> set.seed(1)
> y <- 10 + 2*x + rnorm(n, 0, 2)
> 
> mod <- glm(y ~ x, family = gaussian)
> logLik(mod)
'log Lik.' -8.915768 (df=3)
> sigma <- sqrt(summary(mod)$dispersion)
> sum(log(dnorm(x = y, mean = predict(mod), sd = sigma)))
[1] -9.192832
> sum(log(dnorm(x = resid(mod), mean = 0, sd = sigma)))
[1] -9.192832

由于数据集效应较小,我认为这可能是由于 lm 和 glm 之间的残差估计值不同,但使用 lm 提供与 glm 相同的结果:

> modlm <- lm(y ~ x)
> logLik(modlm)
'log Lik.' -8.915768 (df=3)
> 
> sigma <- summary(modlm)$sigma
> sum(log(dnorm(x = y, mean = predict(modlm), sd = sigma)))
[1] -9.192832
> sum(log(dnorm(x = resid(modlm), mean = 0, sd = sigma)))
[1] -9.192832

我哪里错了?

1个回答

该函数通过将参数的 ML 估计值替换为未知参数的值来logLik()提供对数似然的评估。现在,回归参数的最大似然估计(βj's in)与最小二乘估计一致,但的 ML 估计是,而你是使用的无偏估计的平方根Xβσϵ^i2nσ^=ϵ^i2n2σ2

>  n <- 5
>  x <- 1:n
>  set.seed(1)
>  y <- 10 + 2*x + rnorm(n, 0, 2)
>  modlm <- lm(y ~ x)
>  sigma <- summary(modlm)$sigma
> 
>  # value of the likelihood with the "classical" sigma hat
>  sum(log(dnorm(x = y, mean = predict(modlm), sd = sigma)))
[1] -9.192832
> 
>  # value of the likelihood with the ML sigma hat
>  sigma.ML <- sigma*sqrt((n-dim(model.matrix(modlm))[2])/n) 
>  sum(log(dnorm(x = y, mean = predict(modlm), sd = sigma.ML)))
[1] -8.915768
>  logLik(modlm)
'log Lik.' -8.915768 (df=3)