机器算法验证 - 如何通过输入样本统计而不是原始数据在 R 中执行两样本 t 检验？ - 吾爱随笔录

如何通过输入样本统计而不是原始数据在 R 中执行两样本 t 检验？

机器算法验证 r t检验

2022-02-11 12:54:12

假设我们有下面给出的统计数据

gender mean sd n
f 1.666667 0.5773503 3
m 4.500000 0.5773503 4

您如何使用这样的统计数据而不是实际数据进行双样本 t 检验（查看某些变量中男性和女性的平均值是否存在显着差异）？

我在互联网上的任何地方都找不到如何做到这一点。大多数教程甚至手册只处理实际数据集的测试。

4个回答

您可以根据我们对双样本检验 $t$ 机制的了解编写自己的函数。例如，这将完成这项工作：

# m1, m2: the sample means
# s1, s2: the sample standard deviations
# n1, n2: the same sizes
# m0: the null value for the difference in means to be tested for. Default is 0. 
# equal.variance: whether or not to assume equal variance. Default is FALSE. 
t.test2 <- function(m1,m2,s1,s2,n1,n2,m0=0,equal.variance=FALSE)
{
    if( equal.variance==FALSE ) 
    {
        se <- sqrt( (s1^2/n1) + (s2^2/n2) )
        # welch-satterthwaite df
        df <- ( (s1^2/n1 + s2^2/n2)^2 )/( (s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1) )
    } else
    {
        # pooled standard deviation, scaled by the sample sizes
        se <- sqrt( (1/n1 + 1/n2) * ((n1-1)*s1^2 + (n2-1)*s2^2)/(n1+n2-2) ) 
        df <- n1+n2-2
    }      
    t <- (m1-m2-m0)/se 
    dat <- c(m1-m2, se, t, 2*pt(-abs(t),df))    
    names(dat) <- c("Difference of means", "Std Error", "t", "p-value")
    return(dat) 
}

示例用法：

set.seed(0)
x1 <- rnorm(100)
x2 <- rnorm(200) 
# you'll find this output agrees with that of t.test when you input x1,x2
(tt2 <- t.test2(mean(x1), mean(x2), sd(x1), sd(x2), length(x1), length(x2)))

Difference of means       Std Error               t         p-value 
         0.01183358      0.11348530      0.10427416      0.91704542

这与以下结果相匹配t.test：

(tt <- t.test(x1, x2))

#         Welch Two Sample t-test
#   
#   data:  x1 and x2
#   t = 0.10427, df = 223.18, p-value = 0.917
#   alternative hypothesis: true difference in means is not equal to 0
#   95 percent confidence interval:
#    -0.2118062  0.2354734
#   sample estimates:
#    mean of x  mean of y 
#   0.02266845 0.01083487 

tt$statistic == tt2[["t"]]
#        t 
#     TRUE 

tt$p.value == tt2[["p-value"]]
# [1] TRUE

您只需手动计算：

t = \frac{({mean}_{f} - {mean}_{m}) - expected difference}{S E} S E = \sqrt{\frac{s d_{f}^{2}}{n_{f}} + \frac{s d_{m}^{2}}{n_{m}}} where, d f = n_{m} + n_{f} - 2

$t = \frac{(\text{mean}_f - \text{mean}_m) - \text{expected difference}}{SE} \\ ~\\ ~\\ SE = \sqrt{\frac{sd_f^2}{n_f} + \frac{sd_m^2}{n_m}} \\ ~\\ ~\\ \text{where, }~~~df = n_m + n_f - 2$

预期差异可能为零。

如果您想要 p 值，只需使用以下pt()函数：

pt(t, df)

因此，将代码放在一起：

> p = pt((((1.666667 - 4.500000) - 0)/sqrt(0.5773503/3 + 0.5773503/4)), (3 + 4 - 2))
> p
[1] 0.002272053

这假设方差相等，这是显而易见的，因为它们具有相同的标准偏差。

您可以根据书中（在网页上）中的公式进行计算，或者您可以生成具有所述属性的随机数据（参见包mvrnorm中的函数）并在模拟数据上MASS使用常规函数。t.test

该问题涉及 R，但任何其他统计软件都可能出现此问题。例如，Stata 有各种所谓的即时命令，它们允许仅从汇总统计中进行计算。有关命令的特定情况，请参见http://www.stata.com/manuals13/rttest.pdfttesti，此处适用。

其它你可能感兴趣的问题

上一篇R - 对剩余术语感到困惑下一篇相关性并不意味着因果关系；但是当变量之一是时间时呢？