机器算法验证 - 为什么 shapiro.test 在 lm 上的学生化残差上的 I 型错误（功率？！）为 10%，而常规残差仅为 5%？ - 吾爱随笔录

我知道回归模型的残差不是 iid 因此，检查它们是否正常（即使我们知道是这种情况）应该是一个问题，因为它们是依赖的。残差不是独立的，因为它们需要保持两个条件，即它们的和为 0，并且它们与 x 正交。此外，它们没有相同的分布，因为误差取决于特定的 x。

也就是说，当我对残差运行 shapiro 测试时，我发现常规残差的误差为 5%，学生化残差的误差几乎为 10%（而在这两种情况下，iid 都没有保留）。所以有人可以说，由于这些值不是独立同分布的，那么 shapiro 测试在两种情况下都应该拒绝，这意味着第一种情况的功效是 5%，学生化情况的功效是 10%。

任何洞察这是为什么？（如果有一种方法可以解释依赖关系，而不是更大的样本量）

可重现的代码（见输出注释）：

get_lm_resid_normality_p_value <- function(fit, resid_function = MASS::stdres) {
  # library(MASS)
  library(dplyr)
  library(broom)
  # https://en.wikipedia.org/wiki/Studentized_residual
  resids <- fit %>% resid_function(.) 
  if(length(resids) < 5000 & length(resids) > 3) {
    resids %>% shapiro.test %>% glance %>% pull(p.value) %>% return
  } else {
    return(NA)
  }
}

n <- 30
R <- 10000
x <- 1:n

# stdres
set.seed(1234)
pv <- numeric(R)
for(i in 1:R) {
  y <- x + rnorm(n)
  fit <- lm(y ~ x)
  pv[i] <- get_lm_resid_normality_p_value(fit)
}
mean(pv < 0.05)
# 0.0503 # this is o.k.

# resid
set.seed(1234)
pv <- numeric(R)
for(i in 1:R) {
  y <- x + rnorm(n)
  fit <- lm(y ~ x)
  pv[i] <- get_lm_resid_normality_p_value(fit, resid_function = resid)
}
mean(pv < 0.05)
# 0.05  # this is o.k.


# MASS::studres
set.seed(1234)
pv <- numeric(R)
for(i in 1:R) {
  y <- x + rnorm(n)
  fit <- lm(y ~ x)
  pv[i] <- get_lm_resid_normality_p_value(fit, resid_function = MASS::studres)
}
mean(pv < 0.05)
# 0.0992  # this is NOT o.k. (!)

# a way to fix the problem is to only use half the data for building the model. But is there no more efficient way to undo the dependency structure? 
set.seed(1234)
pv <- numeric(R)
for(i in 1:R) {
   y <- x + rnorm(n)
   xx <- data.frame(x, y)
   fit <- lm(y ~ x, data = xx[1:15,])

   pv[i] <- shapiro.test(xx[-c(1:15),"y"] - predict(fit, newdata = xx[-c(1:15),]))$p.value
}
mean(pv < 0.05)
# 0.0487  # this works since the residuals of the predicted ys are i.i.d