@llewmills:本周,我遇到了一个项目,你询问的偏差编码派上了用场,所以我想在这里分享我在这个主题上学到的东西。
首先,我认为如果我们从一个更简单的模型开始会更容易:
m0.dev <- lm(outcome ~ group, df,
contrasts = list(group = c(-1,1)))
summary(m0.dev)
其输出由下式给出:
Call:
lm(formula = outcome ~ group, data = df,
contrasts = list(group = c(-1, 1)))
Residuals:
Min 1Q Median 3Q Max
-3.3119 -0.6728 0.1027 0.6748 2.8539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.07010 0.07019 29.49 <2e-16 ***
group1 1.98435 0.07019 28.27 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9926 on 198 degrees of freedom
Multiple R-squared: 0.8015, Adjusted R-squared: 0.8005
F-statistic: 799.3 on 1 and 198 DF, p-value: < 2.2e-16
在该模型中,对组使用偏差编码,截距表示 a 组和 b 组结果值 y 的总体平均值,而 group1 的系数表示该总体平均值与结果值 y 的平均值之间的差异a组。总体平均值就是这两个平均值的平均值:(i) 组 a 的 y 平均值和 (ii) 组 b 的 y 平均值。
换句话说:
2.07010 = a 和 b 组 y 的总体平均值
1.98435 =(a 组和 b 组 y 的总体平均值)- a 组 y 的平均值
回想一下,对于您的模拟数据,a 组的 y 平均值为 0.08574619,b 组的 y 平均值为 4.05445164。的确:
means <- tapply(df$outcome, df$group, mean)
means
> means
a b
0.08574619 4.05445164
以下是我们如何从上面报告的模型摘要中恢复这些方法:
# group a:
mean of group a = overall mean - (overall mean - mean of group a)
= intercept in deviation coded model m0.dev - coef of group a (aka, group 1) in model m0.dev
= 2.07010 - 1.98435 = 0.08575
# group b:
mean of group b = overall mean - (overall mean - mean of group b)
= overall mean + (overall mean - mean of group a)
= intercept in deviation coded model m0.dev + coef of group a (aka, group 1) in model m0.dev
= 2.07010 + 1.98435 = 4.05445
在上面,我们使用了每个组的平均值与整体平均值的偏差之和为零的事实:
(overall mean - mean of group a) + (overall mean - mean of group b) = 0.
这意味着:
(overall mean - mean of group b) = -(overall mean - mean of group a).
现在,我们可以使用以下代码绘制特定组的均值和从模型 m0.dev 的摘要中得出的总体均值:
plot(1:2, as.numeric(means), type="n", xlim=c(0.5,2.5), ylim=c(0,5), xlab="group", ylab="y", xaxt="n", las=1)
segments(x0=1-0.2, y0=means[1], x1 = 1+0.2, y1=means[1], lwd=3, col="dodgerblue") # group a mean
segments(x0=2-0.2, y0=means[2], x1 = 2+0.2, y1=means[2], lwd=3, col="orange")# group b mean
segments(x0=1.5-0.2, y0=coef(m0.dev)[1], x1 = 1.5+0.2, y1=coef(m0.dev)[1], lwd=3, col="red3", lty=2) # overall mean
text(1, means[1]+0.3, "Mean value of y \nfor group a\n(0.0857)", cex=0.8)
text(2, means[2]+0.3, "Mean value of y \nfor group b\n(4.0545)", cex=0.8)
text(1.5, coef(m0.dev)[1]+0.3, "Overall mean value of y \n across groups a and b\n(2.0701)", cex=0.8)
axis(1, at = c(1,2), labels=c("a","b"))

我们现在准备转向下面更复杂的模型:
m1.dev <- lm(outcome ~ pred*group, df, contrasts = list(group = c(-1,1)))
summary(m1.dev)
其输出由下式给出:
Call:
lm(formula = outcome ~ pred * group, data = df, contrasts = list(group = c(-1,
1)))
Residuals:
Min 1Q Median 3Q Max
-2.78611 -0.60587 -0.05853 0.58578 2.42940
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.03308 0.06125 33.191 < 2e-16 ***
pred 0.52114 0.06557 7.947 1.47e-13 ***
group1 2.02386 0.06125 33.041 < 2e-16 ***
pred:group1 -0.14406 0.06557 -2.197 0.0292 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8628 on 196 degrees of freedom
Multiple R-squared: 0.8515, Adjusted R-squared: 0.8493
F-statistic: 374.7 on 3 and 196 DF, p-value: < 2.2e-16
该模型本质上拟合了两条不同的线——a 组和 b 组各一条——它们描述了结果值 y 如何随预测变量 pred 变化。两条线的截距是相对于两组整体线的截距来表示的。此外,两条线的斜率相对于两组的整体回归线的斜率表示。该模型报告以下数量:
2.03308 = a 和 b 两组整体回归线的估计截距(即模型 m1.dev 的摘要中报告的截距)
0.52114 = 整体回归线的估计斜率(即模型 m1.dev 的摘要中报告的 pred 系数)
2.02386 = 总体回归线的截距与 a 组(又名第 1 组)的回归线截距之间的估计差异(即模型 m1.dev 的摘要中报告的 group1 的系数)
-0.14406 = 总体回归线斜率与 a 组(又名第 1 组)的回归线斜率之间的估计差异(即模型 m1.dev 的摘要中报告的 pred:group1 的系数)
这是 R 代码,它允许您恢复特定于组的回归线的截距和斜率:
# Note: group a is re-labelled as group 1
intercept of group a line = intercept of overall line - (intercept of overall line - intercept of group a line)
= intercept in deviation coded model m1.dev - coef of group 1 in model m1.dev
= 2.03308 - (2.02386) = 0.00922
slope of group a line = slope of overall line - (slope of overall line - slope of group a line) =
= slope of pred in deviation coded model m1.dev - slope of pred:group1 in model m1.dev
= 0.52114 - (-0.14406) = 0.6652
# Now, use that:
# (intercept of overall line - intercept of group a line) + (intercept of overall line - intercept of group b line) = 0
# which means that:
# (intercept of overall line - intercept of group b line) = - (intercept of overall line - intercept of group a line)
# Similarly:
# (slope of overall line - slope of group a line) + (slope of overall line - slope of group b line) = 0
# which means that:
# (slope of overall line - slope of group b line) = - (slope of overall line - slope of group a line)
intercept of group b line = intercept of overall line - (intercept of overall line - intercept of group b line)
= intercept of overall line + (intercept of overall line - intercept of group a line)
= intercept in deviation coded model m1.dev + coef of group 1 in model m1.dev
= 2.03308 + (2.02386) = 4.05694
slope of group b line = slope of overall line - (slope of overall line - slope of group b line) =
= slope of overall line + (slope of overall line - slope of group a line) =
= slope of pred in deviation coded model m1.dev + slope of pred:group1 in model m1.dev
= 0.52114 + (-0.14406) = 0.37708
使用此信息,您可以使用以下代码绘制特定于组的回归线和整体回归线:
plot(outcome ~ pred, data = df, type="n")
# overall regression line
abline(a = coef(m1.dev)["(Intercept)"],
b = coef(m1.dev)["pred"],
col="red3", lwd=2, lty=2)
# regression line for group a
abline(a = coef(m1.dev)["(Intercept)"] - coef(m1.dev)["group1"],
b = coef(m1.dev)["pred"] - coef(m1.dev)["pred:group1"],
col="dodgerblue", lwd=2)
# regression line for group b
abline(a = coef(m1.dev)["(Intercept)"] + coef(m1.dev)["group1"],
b = coef(m1.dev)["pred"] + coef(m1.dev)["pred:group1"], col="orange", lwd=2)
points(outcome ~ pred, data = subset(df, group=="a"), col="dodgerblue")
points(outcome ~ pred, data = subset(df, group=="b"), col="orange")

当然,模型 m1.dev 的摘要允许您通过判断输出的 Intercept 和 pred 部分报告的 p 值的显着性来测试整个回归线的截距和斜率的显着性。它还允许您测试以下各项的重要性:
- 组的截距偏离总体截距的回归线(通过组 1 报告的 p 值);
- b 组回归线的斜率与整体斜率的偏差(通过 pred:group1 报告的 p 值)。
您何时想使用模型 m1.dev 捕获的偏差编码情况?如果:
- 结果变量 y 代表一些水质参数;
- 预测变量 pred 代表一个(居中的)年份变量;
- 组因子代表一个季节变量(水平 a = 冬季, b = 夏季);
那么您可以设想不仅要报告水质参数值随时间变化的冬季特定和夏季特定(线性)趋势,还要报告两个季节的总体趋势。但是,为了使整体趋势具有可解释性,您可能希望两个季节的数据有一些重叠。(在您生成的数据中,您基本上没有重叠 - 所以总体趋势可能没有提供信息。)
我希望这能回答你的问题,这很有趣。