机器算法验证 - rlm 稳健回归中系数的 P 值 - 吾爱随笔录

我在修改后的 iris 数据集上使用 MASS 包的 rlm 稳健线性回归，如下所示：

> myiris = iris
> myiris$Species = as.numeric(myiris$Species)
> head(myiris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2       1
2          4.9         3.0          1.4         0.2       1
3          4.7         3.2          1.3         0.2       1
4          4.6         3.1          1.5         0.2       1
5          5.0         3.6          1.4         0.2       1
6          5.4         3.9          1.7         0.4       1

> library(MASS)
> rmod = rlm(Species~., data=myiris)
> rmod
Call:
rlm(formula = Species ~ ., data = myiris)
Converged in 6 iterations

Coefficients:
 (Intercept) Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
  1.14943807  -0.11067690  -0.02603537   0.21581357   0.63793686 

Degrees of freedom: 150 total; 145 residual
Scale estimate: 0.191 
> 
> sumrmod = summary(rmod)
> sumrmod

Call: rlm(formula = Species ~ ., data = myiris)
Residuals:
     Min       1Q   Median       3Q      Max 
-0.59732 -0.15769  0.01089  0.10955  0.56317 

Coefficients:
             Value   Std. Error t value
(Intercept)   1.1494  0.2056     5.5906
Sepal.Length -0.1107  0.0579    -1.9128
Sepal.Width  -0.0260  0.0599    -0.4346
Petal.Length  0.2158  0.0571     3.7821
Petal.Width   0.6379  0.0948     6.7287

Residual standard error: 0.1913 on 145 degrees of freedom

这并没有给出 p.values 所以我计算它们如下（使用基 R 的 pt 函数）：

> dd = data.frame(sumrmod$coefficients)                             #$
> dd$p.value =  pt(dd$t.value, sumrmod$df[2])                       #$
> dd
                   Value Std..Error    t.value    p.value
(Intercept)   1.14943807 0.20560264  5.5905804 0.99999995
Sepal.Length -0.11067690 0.05786107 -1.9128044 0.02887227
Sepal.Width  -0.02603537 0.05991073 -0.4345693 0.33226054
Petal.Length  0.21581357 0.05706173  3.7821068 0.99988663
Petal.Width   0.63793686 0.09480869  6.7286751 1.00000000

然而，这些都是不正确的，因为普通的 lm 函数和其他回归函数表明 Petal.Length 和 Petal.Width 在这个回归中非常重要：

> summary(lm(Species~., data=myiris))

Call:
lm(formula = Species ~ ., data = myiris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.59215 -0.15368  0.01268  0.11089  0.55077 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.18650    0.20484   5.792 4.15e-08 ***
Sepal.Length -0.11191    0.05765  -1.941   0.0542 .  
Sepal.Width  -0.04008    0.05969  -0.671   0.5030    
Petal.Length  0.22865    0.05685   4.022 9.26e-05 ***
Petal.Width   0.60925    0.09446   6.450 1.56e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2191 on 145 degrees of freedom
Multiple R-squared:  0.9304,    Adjusted R-squared:  0.9285 
F-statistic: 484.5 on 4 and 145 DF,  p-value: < 2.2e-16

错误在哪里？我在这里没有使用正确的方法来计算 p.value 吗？

编辑：正如@Glen_b 在评论中所建议（进一步）：

> dd$p.value =  2*pt(abs(dd$t.value), sumrmod$df[2], lower.tail=FALSE)      #$
> dd
               Value Std..Error    t.value      p.value
(Intercept)   1.14943807 0.20560264  5.5905804 1.089792e-07
Sepal.Length -0.11067690 0.05786107 -1.9128044 5.774455e-02
Sepal.Width  -0.02603537 0.05991073 -0.4345693 6.645211e-01
Petal.Length  0.21581357 0.05706173  3.7821068 2.267410e-04
Petal.Width   0.63793686 0.09480869  6.7286751 3.691993e-10

这些似乎是正确的（最终）。