我有一个数据集,其中我想用作随机效应的变量对于某些级别只有一个观察值。根据对先前问题的回答,我认为原则上这没问题。
但是,在第二个链接中,第一个答案指出:
“...假设您没有使用广义线性混合模型GLMM,在这种情况下,过度分散的问题就会发挥作用”
我正在考虑使用 GLMM,但我真的不明白单一观察的随机效应水平将如何影响模型。
这是我正在尝试拟合的模型之一的示例。我正在研究鸟类,我想模拟人口和季节对迁徙期间停留次数的影响。我想使用个人作为随机效应,因为对于某些个人,我有长达 5 年的数据。
library(dplyr)
library(lme4)
pop <- as.character(c("BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "NU", "NU", "NU", "NU", "NU", "NU", "NU", "NU", "NU", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA"))
id <- "2 2 4 4 7 7 9 9 10 10 84367 84367 84367 84368 84368 84368 84368 84368 84368 84369 84369 33073 33073 33073 33073 33073 33073 33073 33073 33073 80149 80149 80149 80150 80150 80150 57140 57141 126674 126677 126678 126680 137152 137152 137157 115925 115925 115925 115925 115925 115925 115925 115925 115926 115926 115926 115926 115926 115926 115927 115928 115929 115929 115929 115930 115930 115930 115930 115931 115931 115931 115932 115932 115932"
id <- strsplit(id, " ")
id <- as.numeric(unlist(id))
year <- "2014 2015 2014 2015 2014 2015 2014 2015 2014 2015 2009 2010 2010 2009 2010 2010 2011 2011 2012 2009 2010 2009 2009 2010 2010 2011 2011 2012 2012 2013 2008 2008 2009 2008 2008 2009 2008 2008 2013 2013 2013 2013 2014 2015 2014 2012 2013 2013 2014 2014 2015 2015 2016 2012 2013 2013 2014 2014 2015 2013 2012 2012 2013 2013 2012 2013 2013 2014 2013 2014 2014 2013 2014 2014"
year <- strsplit(year, " ")
year <- as.numeric(unlist(year))
season <- as.character(c("fall", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "fall", "fall", "spring", "fall", "fall", "spring", "fall", "spring", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "spring", "fall", "spring", "spring", "fall", "spring", "spring", "fall", "fall", "fall", "fall", "fall", "fall", "fall", "spring", "fall", "fall", "fall", "spring", "fall", "spring", "fall", "spring", "spring", "fall", "fall", "spring", "fall", "spring", "spring", "fall", "fall", "fall", "fall", "spring", "fall", "fall", "spring", "spring","fall", "fall", "spring", "fall", "fall", "spring"))
stops <- "0 0 0 0 0 0 1 0 2 1 1 0 0 3 2 0 1 1 0 1 1 2 0 1 0 2 0 4 0 0 2 1 1 2 5 2 1 0 9 6 2 3 4 7 2 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0"
stops <- strsplit(stops, " ")
stops <- as.numeric(unlist(stops))
stopdata <- data.frame(pop = pop, id = id, year = year, season = season, stops = stops, stringsAsFactors = FALSE)
stopdata <- group_by(stopdata, pop, id)
summary1 <- summarise(stopdata, n.years = length(year))
table(summary1$n.years)
有27个人。9个人有一个单一的观察。18 个人有 2-9 次观察。
如果 1/3 的随机效应水平只有一个观察值,应该关注什么?
我一直在考虑:
选项 1:如上所述的 GLMM
stops.glmm <- glmer(stops ~ pop + season + (1|id), data=stopdata, family = poisson)
选项 2:加权广义线性模型GLM,使用具有多个观察值的个体的平均值
aggfun <- function(data, idvars=c("pop", "season", "id"), response){
#select id variables, response variable, and year
sub1 <- na.omit(data[,c(idvars, "year", response)])
#aggregate for mean response by year
agg1 <- aggregate(sub1[names(sub1) == response],by=sub1[idvars],FUN=mean)
#sample size for each aggregated group
aggn <- aggregate(sub1[response],by=sub1[idvars],FUN=length)
#rename sample size column
names(aggn)[4] <- "n"
agg2 <- merge(agg1, aggn)
agg2}
#Create weighted dataset
stops.weight <- aggfun(data = stopdata, response = "stops")
stops.weight$stops <- round(stops.weight$stops)
#Weighted GLM
stops.glm <- glm(stops~pop + season, data=stops.weight, family = poisson, weights = n)