只有 1 个观察值的随机效应将如何影响广义线性混合模型?

机器算法验证 r 混合模式 广义线性模型 咕噜咕噜 lme4-nlme
2022-02-12 18:11:15

我有一个数据集,其中我想用作随机效应的变量对于某些级别只有一个观察值。根据对先前问题的回答,我认为原则上这没问题。

我可以用只有 1 个观察值的受试者拟合混合模型吗?

随机截距模型 - 每个受试者一次测量

但是,在第二个链接中,第一个答案指出:

“...假设您没有使用广义线性混合模型GLMM,在这种情况下,过度分散的问题就会发挥作用”

我正在考虑使用 GLMM,但我真的不明白单一观察的随机效应水平将如何影响模型。


这是我正在尝试拟合的模型之一的示例。我正在研究鸟类,我想模拟人口和季节对迁徙期间停留次数的影响。我想使用个人作为随机效应,因为对于某些个人,我有长达 5 年的数据。

library(dplyr)
library(lme4)
pop <- as.character(c("BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "NU", "NU", "NU", "NU", "NU", "NU", "NU", "NU", "NU", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA"))
id <- "2 2 4 4 7 7 9 9 10 10 84367 84367 84367 84368 84368 84368 84368 84368 84368 84369 84369 33073 33073 33073 33073 33073 33073 33073 33073 33073 80149 80149 80149 80150 80150 80150 57140 57141 126674 126677 126678 126680 137152 137152 137157 115925 115925 115925 115925 115925 115925 115925 115925 115926 115926 115926 115926 115926 115926 115927 115928 115929 115929 115929 115930 115930 115930 115930 115931 115931 115931 115932 115932 115932"
id <- strsplit(id, " ")
id <- as.numeric(unlist(id))
year <- "2014 2015 2014 2015 2014 2015 2014 2015 2014 2015 2009 2010 2010 2009 2010 2010 2011 2011 2012 2009 2010 2009 2009 2010 2010 2011 2011 2012 2012 2013 2008 2008 2009 2008 2008 2009 2008 2008 2013 2013 2013 2013 2014 2015 2014 2012 2013 2013 2014 2014 2015 2015 2016 2012 2013 2013 2014 2014 2015 2013 2012 2012 2013 2013 2012 2013 2013 2014 2013 2014 2014 2013 2014 2014"
year <- strsplit(year, " ")
year <- as.numeric(unlist(year))
season <- as.character(c("fall", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "fall", "fall", "spring", "fall", "fall", "spring", "fall", "spring", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "spring", "fall", "spring", "spring", "fall", "spring", "spring", "fall", "fall", "fall", "fall", "fall", "fall", "fall", "spring", "fall", "fall", "fall", "spring", "fall", "spring", "fall", "spring", "spring", "fall", "fall", "spring", "fall", "spring", "spring", "fall", "fall", "fall", "fall", "spring", "fall", "fall", "spring", "spring","fall", "fall", "spring", "fall", "fall", "spring"))
stops <- "0 0 0 0 0 0 1 0 2 1 1 0 0 3 2 0 1 1 0 1 1 2 0 1 0 2 0 4 0 0 2 1 1 2 5 2 1 0 9 6 2 3 4 7 2 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0"
stops <- strsplit(stops, " ")
stops <- as.numeric(unlist(stops))

stopdata <- data.frame(pop = pop, id = id, year = year, season = season, stops = stops, stringsAsFactors = FALSE)


stopdata <- group_by(stopdata, pop, id)
summary1 <- summarise(stopdata, n.years = length(year))
table(summary1$n.years)

有27个人。9个人有一个单一的观察。18 个人有 2-9 次观察。

如果 1/3 的随机效应水平只有一个观察值,应该关注什么?


我一直在考虑:

选项 1:如上所述的 GLMM

stops.glmm <- glmer(stops ~ pop + season + (1|id), data=stopdata, family = poisson)

选项 2:加权广义线性模型GLM,使用具有多个观察值的个体的平均值

aggfun <- function(data, idvars=c("pop", "season", "id"), response){
#select id variables, response variable, and year
sub1 <- na.omit(data[,c(idvars, "year", response)])
#aggregate for mean response by year
agg1 <- aggregate(sub1[names(sub1) == response],by=sub1[idvars],FUN=mean)
#sample size for each aggregated group
aggn <- aggregate(sub1[response],by=sub1[idvars],FUN=length)
#rename sample size column
names(aggn)[4] <- "n"
agg2 <- merge(agg1, aggn)
agg2}


#Create weighted dataset
stops.weight <- aggfun(data = stopdata, response = "stops")
stops.weight$stops <- round(stops.weight$stops)

#Weighted GLM
stops.glm <- glm(stops~pop + season, data=stops.weight, family = poisson, weights = n)
2个回答

一般来说,您有可识别性的问题。将随机效应分配给只有一次测量的参数的线性模型无法区分随机效应和残差。

典型的线性混合效应方程如下所示:

E=β+ηi+ϵj

在哪里β是固定效应,ηi是水平的随机效应i, 和ϵj是剩余变异性j测量。当您对具有随机效应的水平只有一次观察时,很难区分ηϵ. 您将(通常)将方差或标准差拟合到ηϵ,所以每个人只有一个测量值,你将不能确定你有一个准确的估计SD(η)SD(ϵ), 但方差总和的估计值 (var(η)+var(ϵ)) 应该是相对稳健的。

关于实际答案:如果您有大约 1/3 的观察结果,每个人只有一次观察,那么总体上您可能没问题。其余人口应提供合理的估计SD(η)SD(ϵ),并且这些人总体上应该是次要的贡献者。另一方面,如果您让所有个体都具有特定的固定效应和随机效应,并且采用单一测量(例如,对于您的示例,可能是整个人口 - 也许这对您来说意味着物种),那么您会不太相信结果.

这个例子使用了泊松分布,默认是链接日志,对吧?我希望随机效应,即使所有组都有一个观察值,来估计对数中的方差,而残差将测量预测中的误差,包括随机效应。

示例,10 次观察,第一次没有随机效应:

 summary(pp <- glmmTMB(nvn~offset(log(vnhr)),data=foc10,family="poisson"))
 Family: poisson  ( log )
 Formula:          nvn ~ offset(log(vnhr))
 Data: foc10
      AIC      BIC   logLik deviance df.resid
    111.5    111.8    -54.7    109.5        9
 Conditional model:
              Estimate Std. Error z value Pr(>|z|)   
  (Intercept) -1.14680    0.09129  -12.56   <2e-16 ***

现在有随机效应

> summary(pp <- glmmTMB(nvn~offset(log(vnhr))+(1|focal),data=foc10,family="poisson"))
 Family: poisson  ( log )
Formula:          nvn ~ offset(log(vnhr)) + (1 | focal)
Data: foc10
     AIC      BIC   logLik deviance df.resid 
    54.1     54.7    -25.0     50.1        8 

[注意更低的AIC]

 Random effects:
 Conditional model:
 Groups Name        Variance Std.Dev.
 focal  (Intercept) 5.088    2.256   
Number of obs: 10, groups:  focal, 10
Conditional model:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -2.692      1.166  -2.309   0.0209 *

无论如何,没有收敛投诉。
而如果我省略泊松的话。