机器算法验证 - 使用 lm 进行 2 样本比例测试 - 吾爱随笔录

使用 lm 进行 2 样本比例测试

机器算法验证 r 假设检验广义线性模型部分

2022-02-28 08:04:22

我一直在使用线性模型来执行 2 样本比例测试，但我意识到这可能并不完全正确。看来，使用具有二项式族 + 身份链接的广义线性模型可以准确地给出未合并的 2 样本比例测试结果。但是，使用线性模型（或带有高斯族的 glm）给出的结果略有不同。我正在合理化这可能是由于 R 如何解决二项式与高斯族的 glm 问题，但可能还有其他原因吗？

## prop.test gives pooled 2-sample proportion result
## glm w/ binomial family gives unpooled 2-sample proportion result
## lm and glm w/ gaussian family give unknown result

library(dplyr)
library(broom)
set.seed(12345)

## set up dataframe -------------------------
n_A <- 5000
n_B <- 5000

outcome <- rbinom(
  n = n_A + n_B,
  size = 1,
  prob = 0.5
)
treatment <- c(
  rep("A", n_A),
  rep("B", n_B)
)

df <- tbl_df(data.frame(outcome = outcome, treatment = treatment))


## by hand, 2-sample prop tests ---------------------------------------------
p_A <- sum(df$outcome[df$treatment == "A"])/n_A
p_B <- sum(df$outcome[df$treatment == "B"])/n_B

p_pooled <- sum(df$outcome)/(n_A + n_B)
z_pooled <- (p_B - p_A) / sqrt( p_pooled * (1 - p_pooled) * (1/n_A + 1/n_B) )
pvalue_pooled <- 2*(1-pnorm(abs(z_pooled)))

z_unpooled <- (p_B - p_A) / sqrt( (p_A * (1 - p_A))/n_A + (p_B * (1 - p_B))/n_B )
pvalue_unpooled <- 2*(1-pnorm(abs(z_unpooled)))


## using prop.test --------------------------------------
res_prop_test <- tidy(prop.test(
  x = c(sum(df$outcome[df$treatment == "A"]), 
        sum(df$outcome[df$treatment == "B"])),
  n = c(n_A, n_B),
  correct = FALSE
))
res_prop_test # same as pvalue_pooled
all.equal(res_prop_test$p.value, pvalue_pooled)
# [1] TRUE


# using glm with identity link -----------------------------------
res_glm_binomial <- df %>%
  do(tidy(glm(outcome ~ treatment, family = binomial(link = "identity")))) %>%
  filter(term == "treatmentB")
res_glm_binomial # same as p_unpooled
all.equal(res_glm_binomial$p.value, pvalue_unpooled)
# [1] TRUE


## glm and lm gaussian --------------------------------

res_glm <- df %>%
  do(tidy(glm(outcome ~ treatment))) %>%
  filter(term == "treatmentB")
res_glm 
all.equal(res_glm$p.value, pvalue_unpooled)
all.equal(res_glm$p.value, pvalue_pooled)

res_lm <- df %>%
  do(tidy(lm(outcome ~ treatment))) %>% 
  filter(term == "treatmentB")
res_lm
all.equal(res_lm$p.value, pvalue_unpooled)
all.equal(res_lm$p.value, pvalue_pooled)

all.equal(res_lm$p.value, res_glm$p.value)
# [1] TRUE

2个回答

这与他们如何解决与拟合模型相对应的优化问题无关，而是与模型提出的实际优化问题有关。

具体来说，在大样本中，您可以有效地将其视为比较两个加权最小二乘问题

线性模型 ( lm) 假设（未加权时）比例的方差是恒定的。glm 假设比例的方差来自二项式假设 $\text{Var}(\hat{p})=\text{Var}(X/n) = p(1-p)/n$ . 这会对数据点进行不同的加权，因此会产生不同的估计*和不同的差异方差。

* 至少在某些情况下，虽然不一定是直接的比例比较

在计算方面，比较lm与二项式glm的treatmentB系数的标准误差。您有二项式 glm（z_unpooled 的分母）中treatmentB 系数的标准误差公式。标准lm中处理B系数的标准误差为(SE_lm)：

    test = lm(outcome ~ treatment, data = df)
    treat_B =  as.numeric(df$treatment == "B")
    SE_lm = sqrt( sum(test$residuals^2)/(n_A+n_B-2) / 
              sum((treat_B - mean(treat_B))^2))

请参阅此帖子以获取推导，唯一的区别是此处找到示例错误而不是 $\sigma^2$ （即减去 2 $n_A+n_B$ 对于失去的自由度）。没有那个 $-2$ , lm 和二项式 glm 标准误差实际上似乎匹配 $n_A = n_B$ .

其它你可能感兴趣的问题

上一篇回归模型中的左侧和右侧命名法下一篇为什么神经网络需要激活函数？