为了补充 AdamO 的答案,我被教导将我关于模型假设的决定更多地基于未能以某种方式纠正假设是否会导致我歪曲我的数据。对于我的意思的一个具体示例,我模拟了一些数据R并创建了一些图并使用这些数据运行了一些诊断。
# lmSupport contains the lm.modelAssumptions function that I use below
require(lmSupport)
set.seed(12234)
# Create some data with a strong quadratic component
x <- rnorm(200, sd = 1)
y <- x + .75 * x^2 + rnorm(200, sd = 1)
# There is a significant linear trend
mod <- lm(y ~ x)
summary(mod)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.7972 -0.9511 -0.1312 0.6659 5.8659
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.77981 0.10463 7.453 2.77e-12 ***
x 1.19417 0.09795 12.191 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.477 on 198 degrees of freedom
Multiple R-squared: 0.4288, Adjusted R-squared: 0.4259
F-statistic: 148.6 on 1 and 198 DF, p-value: < 2.2e-16
但是,在绘制数据时,很明显曲线分量是 x 和 y 之间关系的一个重要方面。
pX <- seq(min(x), max(x), by = .1)
pY <- predict(mod, data.frame(x = pX))
plot(x, y, frame = F)
lines(pX, pY, col = "red")

线性诊断测试也支持我们的论点,即二次分量是这些数据的 x 和 y 之间关系的一个重要方面。
lm.modelAssumptions(mod, "linear")
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
0.7798 1.1942
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = model)
Value p-value Decision
Global Stat 180.04567 0.000e+00 Assumptions NOT satisfied!
Skewness 32.67166 1.091e-08 Assumptions NOT satisfied!
Kurtosis 23.99022 9.683e-07 Assumptions NOT satisfied!
Link Function 123.35831 0.000e+00 Assumptions NOT satisfied!
Heteroscedasticity 0.02547 8.732e-01 Assumptions acceptable.
# We should probably add the quadratic component to this model
mod <- lm(y ~ x + I(x^2))
让我们看看当我们模拟具有较小(但仍然显着)非线性趋势的数据时会发生什么。
y <- x + .25 * x^2 + rnorm(200, sd = 1)
mod <- lm(y ~ x)
summary(mod)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.59701 -0.77446 0.03546 0.80261 2.75938
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.30500 0.07907 3.858 0.000155 ***
x 0.99934 0.07402 13.500 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.116 on 198 degrees of freedom
Multiple R-squared: 0.4793, Adjusted R-squared: 0.4767
F-statistic: 182.3 on 1 and 198 DF, p-value: < 2.2e-16
如果我们检查这些新数据的图表,很明显它们仅由线性趋势很好地代表。
pX <- seq(min(x), max(x), by = .1)
pY <- predict(mod, data.frame(x = pX))
plot(x, y, frame = F)
lines(pX, pY, col = "red")

尽管该模型未能通过线性诊断测试。
lm.modelAssumptions(mod, "linear")
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
0.3050 0.9993
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = model)
Value p-value Decision
Global Stat 34.6428 5.500e-07 Assumptions NOT satisfied!
Skewness 0.3355 5.624e-01 Assumptions acceptable.
Kurtosis 2.0094 1.563e-01 Assumptions acceptable.
Link Function 32.1379 1.436e-08 Assumptions NOT satisfied!
Heteroscedasticity 0.1600 6.892e-01 Assumptions acceptable.
我的观点是,诊断测试不应该替代分析师的思考。它们是帮助您了解您的实质性结论是否来自您的分析的工具。出于这个原因,当我做出这些决定时,我更喜欢查看不同类型的图而不是依赖全局测试。