机器算法验证 - III 型 SS 缺失细胞 - 吾爱随笔录

我有一个 y ~ x + z + x:z 形式的线性模型。

我有不平衡的数据，并且还有一些丢失的单元格（即使没有那些丢失的单元格，数据也会不平衡）。

我对不同类型 SS 的理解主要来自于如何使用 R 中的不同模型比较手动获得 SS。例如，要获得与 z 相关的 III 型 SS，您可以将包含 x 和 x:z 的模型与包括 x、z 和 x:z 的模型。有关此过程的更多详细信息，请参阅此答案。

我将这种模型比较方法（对于 III 型 SS）理解为丢弃不能明确分配给一个术语的任何错误。我可以对缺少单元格的数据执行此方法，并且 R 中 car 包中的 Anova 函数将计算缺失单元格的方差分析表，但我已经阅读了一些地方（例如这里）不应该使用 III 型 SS细胞数据。

使用缺少单元格的 III 型 SS 真的是错误的，还是通常不合适，为什么？
使用缺少单元格的 III 型 SS 与没有缺少单元格（但仍有不平衡数据）的方差分析表有何不同？

编辑

我在 R 中进行了以下模拟，比较了合成数据中不同模式缺失观察的 I 类错误率（平衡，观察完全随机缺失，细胞完全随机缺失，细胞和观察都随机缺失，都具有相同的数字观察）。从结果来看，我认为在缺少单元格的数据上使用 III 型 SS 没有问题。任何数据集似乎都没有违反 I 类错误率。

对于一次运行，我得到了以下观察到的 I 类错误率：

                       intercept       effect A        effect B 
balanced               .056            .058            .039
missing obs            .052            .052            .039 
missing cells          .050            .055            .044   
missing obs and cells  .056            .050            .049

在不同的设置中，我为 B 添加了主效应，并且功率和观察到的 alpha 水平似乎都不受缺失细胞的影响。

我在这里不理解重要的事情吗？

这是我使用的代码：

library(car) # for Type III SS

# Data properties (simulation may not work if changing these)
nLevelsA <- 5
nLevelsB <- 5
nreps <- 5
factA <- rep(1:5, times = nLevelsA*nreps)
factB <- rep(1:5, each = nLevelsB*nreps)
cell <- factor(paste(factA, factB, sep = ""))

nSims <- 1000 # Number of repetitions of the simulation
# Empty output objects
mod1P <- matrix(rep(NA, times = nSims * 3), ncol = 3)
mod2P <- matrix(rep(NA, times = nSims * 3), ncol = 3)
mod3P <- matrix(rep(NA, times = nSims * 3), ncol = 3)
mod4P <- matrix(rep(NA, times = nSims * 3), ncol = 3)

# Create four datasets, all based on the same values, but different patterns of cell counts
for(i in 1:nSims){
    myData0 <- data.frame(y = rnorm(nLevelsA*nLevelsB*nreps, 0, 1), A = factA, B = factB, cell = cell)
    # Use this to add a main effect for B
    # for(m in 1:nLevelsB){
    #       myData0[which(myData0$B == m), 1] <- myData0[which(myData0$B == m), 1] + m
    # }

    # Randomly remove one observation from each cell
    myData1 <- myData0
    for(j in 1:length(levels(cell))){
        # randomly pick one row from each cell
        rowNums <- 1:nrow(myData1)
        rowToDrop <- sample(rowNums[which(myData1$cell == levels(cell)[j])], 1)
        myData1 <- myData1[-rowToDrop, ]
    }

    # Randomly remove observations without regard to cell
    myData2 <- myData1[sample(1:nrow(myData0), nrow(myData1), replace = FALSE), ] 

    # Randomly empty cells from balanced data
    myData3 <- myData0
    cellToDrop <- sample(levels(myData0$cell), 5)
	for(k in 1:length(cellToDrop)){
		myData3 <- myData3[which(myData3$cell != cellToDrop[k]), ]
    }

    # Randomly empty cells from unbalanced data
    myData4 <- myData0
    cellToDrop <- sample(levels(myData0$cell), 3)
	for(l in 1:length(cellToDrop)){
		myData4 <- myData4[which(myData4$cell != cellToDrop[l]), ]
    }
    myData4 <- myData4[sample(1:nrow(myData4), nrow(myData1), replace = FALSE), ] 


    # nrow(myData0  # basis to start with to get the same number observations in each other dataset
    # nrow(myData1) # balanced
    # nrow(myData2) # observations missing at random
    # nrow(myData3) # cells missing at random
    # nrow(myData4) # cells and observations missing at random
    # 
    # 
    # xtabs(~ A + B, data = myData0)
    # xtabs(~ A + B, data = myData1)
    # xtabs(~ A + B, data = myData2)
    # xtabs(~ A + B, data = myData3)
    # xtabs(~ A + B, data = myData4)


    mod1 <- lm(y ~ A + B, data = myData1)
    mod2 <- lm(y ~ A + B, data = myData2)
    mod3 <- lm(y ~ A + B, data = myData3)
    mod4 <- lm(y ~ A + B, data = myData4)

    # P-values for intercept, factor A, and factor B
    mod1P[i, ] <- Anova(mod1, type = 3)$'Pr(>F)'[1:3]
	mod2P[i, ] <- Anova(mod2, type = 3)$'Pr(>F)'[1:3]
    mod3P[i, ] <- Anova(mod3, type = 3)$'Pr(>F)'[1:3]
	mod4P[i, ] <- Anova(mod4, type = 3)$'Pr(>F)'[1:3]

}

# Count how many times a significant test result is found
p1 <- mod1P <= .05
p2 <- mod2P <= .05
p3 <- mod3P <= .05
p4 <- mod4P <= .05

# Get proportion of times a significant test result is found
pobs1 <- apply(p1, MARGIN = 2, FUN = sum) / nSims
pobs2 <- apply(p2, MARGIN = 2, FUN = sum) / nSims
pobs3 <- apply(p3, MARGIN = 2, FUN = sum) / nSims
pobs4 <- apply(p4, MARGIN = 2, FUN = sum) / nSims

# Proportion of experiments observed as significant at 0.05 level.
# Intercept, then factor A, then factor B
pobs1 # balanced
pobs2 # observations missing at random
pobs3 # cells missing at random
pobs4 # cells and observations missing at random