我正在做一些对数线性模型来测试多路列联表中的交互/关联(基于此处的教程,http://ww2.coastal.edu/kingw/statistics/R-tutorials/loglin.html)。我在观察到的频率上使用泊松glm
以及使用MASS
's来执行此操作loglm
。我只是想知道哪种类型的假设检验在这里最有意义,使用顺序类型 I anova()
(不好,因为那里的 p 值取决于模型中因子的顺序),使用Anova()
in的类型 III 测试car
(独立于模型中的因素)还是drop1
从最复杂的模型开始?
例如使用泰坦尼克号乘客生存数据
library(COUNT)
data(titanic)
titanic=droplevels(titanic)
head(titanic)
mytable=xtabs(~class+age+sex+survived, data=titanic)
ftable(mytable)
survived no yes
class age sex
1st class child women 0 1
man 0 5
adults women 4 140
man 118 57
2nd class child women 0 13
man 0 11
adults women 13 80
man 154 14
3rd class child women 17 14
man 35 13
adults women 89 76
man 387 75
freqdata=data.frame(mytable)
fullmodel=glm(Freq~SITE*SEX*MORTALITY,family=poisson,data=freqdata)
那么对于不同分类因素之间的相互作用的最明智的测试是否会由 I 型 SS 给出,如
anova(fullmodel, test="Chisq")
Analysis of Deviance Table
Model: poisson, link: log
Response: Freq
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 23 2173.33
class 2 231.18 21 1942.15 < 2.2e-16 ***
age 1 1072.61 20 869.54 < 2.2e-16 ***
sex 1 137.74 19 731.80 < 2.2e-16 ***
survived 1 77.61 18 654.19 < 2.2e-16 ***
class:age 2 32.41 16 621.78 9.178e-08 ***
class:sex 2 29.61 14 592.17 3.719e-07 ***
age:sex 1 6.09 13 586.09 0.01363 *
class:survived 2 132.69 11 453.40 < 2.2e-16 ***
age:survived 1 25.58 10 427.81 4.237e-07 ***
sex:survived 1 312.93 9 114.89 < 2.2e-16 ***
class:age:sex 2 4.04 7 110.84 0.13250
class:age:survived 2 35.45 5 75.39 2.002e-08 ***
class:sex:survived 2 73.71 3 1.69 < 2.2e-16 ***
age:sex:survived 1 1.69 2 0.00 0.19421
class:age:sex:survived 2 0.00 0 0.00 1.00000
或使用 III 型 SS 使用car
's Anova
:
library(car)
library(afex)
set_sum_contrasts()
Anova(fullmodel, test="LR", type="III")
Analysis of Deviance Table (Type III tests)
Response: Freq
LR Chisq Df Pr(>Chisq)
class 37.353 2 7.744e-09 ***
age 5.545 1 0.0185317 *
sex 0.000 1 0.9999999
survived 1.386 1 0.2390319
class:age 5.476 2 0.0646851 .
class:sex 0.000 2 1.0000000
age:sex 0.000 1 0.9999888
class:survived 16.983 2 0.0002052 ***
age:survived 0.056 1 0.8126973
sex:survived 0.000 1 0.9999953
class:age:sex 0.000 2 1.0000000
class:age:survived 3.461 2 0.1771673
class:sex:survived 0.000 2 1.0000000
age:sex:survived 0.000 1 0.9999905
class:age:sex:survived 0.000 2 1.0000000
或使用单项删除和 LRT drop1
:
fullmodel=glm(Freq~class+age+sex+survived+class:age+class:sex+class:survived+age:sex+age:survived+sex:survived, family=poisson, data=freqdata)
drop1(fullmodel,test="Chisq")
Single term deletions
Model:
Freq ~ class + age + sex + survived + class:age + class:sex +
class:survived + age:sex + age:survived + sex:survived
Df Deviance AIC LRT Pr(>Chi)
<none> 114.89 249.01
class:age 2 162.76 292.89 47.877 4.016e-11 ***
class:sex 2 115.74 245.86 0.850 0.6537
class:survived 2 230.95 361.08 116.067 < 2.2e-16 ***
age:sex 1 114.89 247.02 0.008 0.9294
age:survived 1 134.39 266.52 19.505 1.003e-05 ***
sex:survived 1 427.81 559.94 312.927 < 2.2e-16 ***
?
[最后一个结果似乎与MASS
's相匹配,loglm
情况应该如此:
fullmodel=loglm(~class+age+sex+survived+class:age+class:sex+class:survived+age:sex+age:survived+sex:survived, mytable)
stepAIC(fullmodel)
drop1(fullmodel,test="Chisq")
Single term deletions
Model:
~class + age + sex + survived + class:age + class:sex + class:survived +
age:sex + age:survived + sex:survived
Df AIC LRT Pr(>Chi)
<none> 144.89
class:age 2 188.76 47.877 4.016e-11 ***
class:sex 2 141.74 0.850 0.6537
class:survived 2 256.95 116.067 < 2.2e-16 ***
age:sex 1 142.89 0.008 0.9294
age:survived 1 162.39 19.505 1.003e-05 ***
sex:survived 1 455.81 312.927 < 2.2e-16 ***
]
(顺便说一句,还有其他更优雅的方法来指定具有主效应+所有一阶交互效应的模型吗?)
有什么想法是分析这种多路列联表并充分测试不平衡数据集的关联的最佳方法吗?
编辑:根据下面的答案,我找到了drop1
解决方案:
fullmodel=glm(Freq~class+age+sex+survived+class:age+class:sex+class:survived+age:sex+age:survived+sex:survived, family=poisson, data=freqdata)
drop1(fullmodel,test="Chisq")
这相当于 中的对数线性模型MASS
:
fullmodel=loglm(~class+age+sex+survived+class:age+class:sex+class:survived+age:sex+age:survived+sex:survived, mytable)
stepAIC(fullmodel)
drop1(fullmodel,test="Chisq")