“解释的可变性比例”究竟是什么?

机器算法验证 r 方差分析 变化性
2022-04-13 12:08:59

经常听到有人说“超过 70% 的可变性是由……解释的”,这究竟是什么意思?平方和 (SSE) 的比例,还是平方和 (MSE) 的平均值?例如在下面的方差分析表中:

                                    Df Sum Sq Mean Sq F value Pr(>F)    
as.factor(site)                    444   8357   18.82   163.1 <2e-16 ***
as.factor(year)                     12    569   47.43   410.9 <2e-16 ***
as.factor(month)                     5    863  172.53  1494.8 <2e-16 ***
as.factor(year):as.factor(month)    60    769   12.82   111.1 <2e-16 ***
Residuals                        34188   3946    0.12                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
7176 observations deleted due to missingness

我们可以说大部分的可变性是由 解释的site吗?我们看到该站点涵盖了大部分 SSE,但由于站点很多,因此站点的 MSE 几乎是表中最低的。

在实践中我将如何解释这一点?我想知道可变性在哪里,它是否主要在时间或空间上变化。实际上是site最大的可变性来源,还是一个monthyear我应该为此阅读 SSE 或 MSE 专栏吗?

PS:请注意我不是专业的统计学家,所以如果你要回答很多数学问题,那么请给傻瓜做一些简单的总结:-)

1个回答

当您听到“超过 70% 的可变性由……解释”时,说话者指的是平方和 (SS),而不是均方 (MS)。我应该指出,它们的确切含义并不确定。他们可能指的是 eta-squared 或部分 eta-squared:

η2=SS IVjSS Total  ηpartial2=SS IVjSS IVj+SS Residuals
Part of the reason why is that the SS can be partitioned (at least if you are using type I SS, see here), but the MS cannot.

You raise a good point that there is more opportunity for a given factor to contribute to the variability in the response when there are more groups in that factor (this assumes, of course, that there is real variability in the levels of the factor). Many people forget, or are ignorant of, this fact. Unfortunately, it is not possible to get around this issue. The implication of this is that the question 'which factor is most important' may not be answerable in an absolute sense, but only relative to something else.