机器算法验证 - 在列联表频率计数上与对数线性泊松 glm 一起使用的推理类型 - 吾爱随笔录

在列联表频率计数上与对数线性泊松 glm 一起使用的推理类型

机器算法验证 r 广义线性模型泊松分布对数线性

2022-03-09 05:11:07

我正在做一些对数线性模型来测试多路列联表中的交互/关联（基于此处的教程，http://ww2.coastal.edu/kingw/statistics/R-tutorials/loglin.html）。我在观察到的频率上使用泊松glm以及使用MASS's来执行此操作loglm。我只是想知道哪种类型的假设检验在这里最有意义，使用顺序类型 I anova()（不好，因为那里的 p 值取决于模型中因子的顺序），使用Anova()in的类型 III 测试car（独立于模型中的因素）还是drop1从最复杂的模型开始？

例如使用泰坦尼克号乘客生存数据

library(COUNT)
data(titanic)
titanic=droplevels(titanic)
head(titanic)
mytable=xtabs(~class+age+sex+survived, data=titanic)
ftable(mytable)
                       survived  no yes
class     age    sex                   
1st class child  women            0   1
                 man              0   5
          adults women            4 140
                 man            118  57
2nd class child  women            0  13
                 man              0  11
          adults women           13  80
                 man            154  14
3rd class child  women           17  14
                 man             35  13
          adults women           89  76
                 man            387  75
freqdata=data.frame(mytable)
fullmodel=glm(Freq~SITE*SEX*MORTALITY,family=poisson,data=freqdata)

那么对于不同分类因素之间的相互作用的最明智的测试是否会由 I 型 SS 给出，如

anova(fullmodel, test="Chisq")
Analysis of Deviance Table

Model: poisson, link: log

Response: Freq

Terms added sequentially (first to last)


                       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                      23    2173.33              
class                   2   231.18        21    1942.15 < 2.2e-16 ***
age                     1  1072.61        20     869.54 < 2.2e-16 ***
sex                     1   137.74        19     731.80 < 2.2e-16 ***
survived                1    77.61        18     654.19 < 2.2e-16 ***
class:age               2    32.41        16     621.78 9.178e-08 ***
class:sex               2    29.61        14     592.17 3.719e-07 ***
age:sex                 1     6.09        13     586.09   0.01363 *  
class:survived          2   132.69        11     453.40 < 2.2e-16 ***
age:survived            1    25.58        10     427.81 4.237e-07 ***
sex:survived            1   312.93         9     114.89 < 2.2e-16 ***
class:age:sex           2     4.04         7     110.84   0.13250    
class:age:survived      2    35.45         5      75.39 2.002e-08 ***
class:sex:survived      2    73.71         3       1.69 < 2.2e-16 ***
age:sex:survived        1     1.69         2       0.00   0.19421    
class:age:sex:survived  2     0.00         0       0.00   1.00000

或使用 III 型 SS 使用car's Anova：

library(car)
library(afex)
set_sum_contrasts()
Anova(fullmodel, test="LR", type="III")
Analysis of Deviance Table (Type III tests)

Response: Freq
                       LR Chisq Df Pr(>Chisq)    
class                    37.353  2  7.744e-09 ***
age                       5.545  1  0.0185317 *  
sex                       0.000  1  0.9999999    
survived                  1.386  1  0.2390319    
class:age                 5.476  2  0.0646851 .  
class:sex                 0.000  2  1.0000000    
age:sex                   0.000  1  0.9999888    
class:survived           16.983  2  0.0002052 ***
age:survived              0.056  1  0.8126973    
sex:survived              0.000  1  0.9999953    
class:age:sex             0.000  2  1.0000000    
class:age:survived        3.461  2  0.1771673    
class:sex:survived        0.000  2  1.0000000    
age:sex:survived          0.000  1  0.9999905    
class:age:sex:survived    0.000  2  1.0000000

或使用单项删除和 LRT drop1：

fullmodel=glm(Freq~class+age+sex+survived+class:age+class:sex+class:survived+age:sex+age:survived+sex:survived, family=poisson, data=freqdata)
drop1(fullmodel,test="Chisq")
Single term deletions

Model:
Freq ~ class + age + sex + survived + class:age + class:sex + 
    class:survived + age:sex + age:survived + sex:survived
               Df Deviance    AIC     LRT  Pr(>Chi)    
<none>              114.89 249.01                      
class:age       2   162.76 292.89  47.877 4.016e-11 ***
class:sex       2   115.74 245.86   0.850    0.6537    
class:survived  2   230.95 361.08 116.067 < 2.2e-16 ***
age:sex         1   114.89 247.02   0.008    0.9294    
age:survived    1   134.39 266.52  19.505 1.003e-05 ***
sex:survived    1   427.81 559.94 312.927 < 2.2e-16 ***

[最后一个结果似乎与MASS's相匹配，loglm情况应该如此：

fullmodel=loglm(~class+age+sex+survived+class:age+class:sex+class:survived+age:sex+age:survived+sex:survived, mytable)
stepAIC(fullmodel) 
drop1(fullmodel,test="Chisq")
Single term deletions

Model:
~class + age + sex + survived + class:age + class:sex + class:survived + 
    age:sex + age:survived + sex:survived
               Df    AIC     LRT  Pr(>Chi)    
<none>            144.89                      
class:age       2 188.76  47.877 4.016e-11 ***
class:sex       2 141.74   0.850    0.6537    
class:survived  2 256.95 116.067 < 2.2e-16 ***
age:sex         1 142.89   0.008    0.9294    
age:survived    1 162.39  19.505 1.003e-05 ***
sex:survived    1 455.81 312.927 < 2.2e-16 ***

]

（顺便说一句，还有其他更优雅的方法来指定具有主效应+所有一阶交互效应的模型吗？）

有什么想法是分析这种多路列联表并充分测试不平衡数据集的关联的最佳方法吗？

编辑：根据下面的答案，我找到了drop1解决方案：

fullmodel=glm(Freq~class+age+sex+survived+class:age+class:sex+class:survived+age:sex+age:survived+sex:survived, family=poisson, data=freqdata)
drop1(fullmodel,test="Chisq")

这相当于中的对数线性模型MASS：

fullmodel=loglm(~class+age+sex+survived+class:age+class:sex+class:survived+age:sex+age:survived+sex:survived, mytable)
stepAIC(fullmodel) 
drop1(fullmodel,test="Chisq")

2个回答

如果我必须根据您的设置方式进行选择，我想我会选择Anova()，但两者都没有多大意义。R 将变量输入模型的顺序是标准化的和任意的。我不会用它来定义我要运行的测试。

相反，在 MASS 库中使用?loglin或?loglm，然后删除您对测试感兴趣的特定变量/组合。这里loglm有一个使用泰坦尼克数据集的 R教程。

正如@DWin 所指出的，顺序与非区分对应于有意义的不同假设。所以这个问题只能由研究人员来回答。R 世界中这一点的标准版本是 Venables 的论文，Exegesis on linear models ( pdf )。鉴于您说您只是想知道“哪些因素相互关联”，这似乎不太像条件推断，而更像是从完整模型中删除指定的关联并对其进行测试，或者可能是测试从剥离模型中删除的关联其他变量根本不包括在内。

CAR 推荐用于此类分析的方法是使用符合边际原则的 Anova Type II 测试。

拟合饱和模型：

library(COUNT)
data(titanic)
titanic=droplevels(titanic)
mytable=xtabs(~class+age+sex+survived, data=titanic)
freqdata=data.frame(mytable)
fullmodel=glm(Freq~class*age*sex*survived, family=poisson, data=freqdata)

然后是 II 型方差分析：

> Anova(fullmodel)
Analysis of Deviance Table (Type II tests)

Response: Freq
                       LR Chisq Df Pr(>Chisq)    
class                    231.18  2  < 2.2e-16 ***
age                     1072.61  1  < 2.2e-16 ***
sex                      137.74  1  < 2.2e-16 ***
survived                  77.61  1  < 2.2e-16 ***
class:age                 41.24  2  1.107e-09 ***
class:sex                  2.30  2   0.316761    
age:sex                    0.27  1   0.604214    
class:survived           114.88  2  < 2.2e-16 ***
age:survived              20.34  1  6.486e-06 ***
sex:survived             318.53  1  < 2.2e-16 ***
class:age:sex              9.78  2   0.007509 ** 
class:age:survived        37.26  2  8.101e-09 ***
class:sex:survived        64.07  2  1.220e-14 ***
age:sex:survived           1.69  1   0.194209    
class:age:sex:survived     0.00  2   1.000000    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

你从下到上围绕这张桌子工作。很明显，这些术语class:age:sex:survived和age:sex:survived具有较大的 p 值，因此它们可能可以忽略不计。我们在没有这些条件的情况下重新拟合模型：

secondmodel=glm(Freq~class*age*sex*survived - age:sex:survived - class:age:sex:survived, 
                family=poisson, data=freqdata)
Anova(secondmodel)

产量：

> Anova(secondmodel)
Analysis of Deviance Table (Type II tests)

Response: Freq
                   LR Chisq Df Pr(>Chisq)    
class                231.18  2  < 2.2e-16 ***
age                 1072.61  1  < 2.2e-16 ***
sex                  137.74  1  < 2.2e-16 ***
survived              77.61  1  < 2.2e-16 ***
class:age             48.00  2  3.767e-11 ***
class:sex              0.85  2     0.6530    
age:sex                0.27  1     0.6042    
class:survived       115.42  2  < 2.2e-16 ***
age:survived          20.34  1  6.486e-06 ***
sex:survived         318.53  1  < 2.2e-16 ***
class:age:sex         20.27  2  3.971e-05 ***
class:age:survived    44.21  2  2.507e-10 ***
class:sex:survived    73.71  2  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

我们可以检查对数线性拟合本身：

> secondmodel

Call:  glm(formula = Freq ~ class * age * sex * survived - age:sex:survived - 
    class:age:sex:survived, family = poisson, data = freqdata)

Coefficients:
                         (Intercept)                        class2nd class  
                           -25.43983                               1.08137  
                      class3rd class                             ageadults  
                            28.11861                              26.82613  
                              sexman                           survivedyes  
                             5.89242                              25.43983  
            class2nd class:ageadults              class3rd class:ageadults  
                             0.09728                             -24.98930  
               class2nd class:sexman                 class3rd class:sexman  
                            -1.84450                              -4.94864  
                    ageadults:sexman            class2nd class:survivedyes  
                            -2.50803                               1.48358  
          class3rd class:survivedyes                 ageadults:survivedyes  
                           -25.31933                             -21.88449  
                  sexman:survivedyes       class2nd class:ageadults:sexman  
                            -4.28298                               0.93211  
     class3rd class:ageadults:sexman  class2nd class:ageadults:survivedyes  
                             3.00077                              -3.22185  
class3rd class:ageadults:survivedyes     class2nd class:sexman:survivedyes  
                            21.54657                               0.06801  
   class3rd class:sexman:survivedyes  
                             2.89768  

Degrees of Freedom: 23 Total (i.e. Null);  3 Residual
Null Deviance:      2173 
Residual Deviance: 1.685    AIC: 147.8
> 1 - pchisq(1.685, 3)
[1] 0.6402738

我们不拒绝失拟检验（拟合模型与饱和模型）的空值，因此我们得出结论，该模型非常适合数据。然后，您可以检查来自的系数summary(secondmodel)。

其它你可能感兴趣的问题

上一篇检验分位数回归线斜率的统计差异下一篇连续变量违反 Cox 比例危害