数据挖掘 - 如何检验假设？ - 吾爱随笔录

如何检验假设？

数据挖掘机器学习统计数据

2022-02-16 08:35:44

我有一个名为 app_satisfaction 的表，其中包含用户 ID、满意度、他们邀请的人数。

我按满意度进行了分组，平均而言。满意人数=“BAD”组邀请2.25人，“GOOD”组邀请2.09人，“EXECELLENT”组邀请1.89人。

所以我的假设是，不喜欢该应用程序的人更有可能邀请人们，因为邀请人们会给他们免费优惠券，并且他们不喜欢将自己的钱花在他们不喜欢的应用程序上。

我有一个问题，仅通过查看每个组中的平均邀请来得出结论似乎是不合理的。与“差”组相比，“好”、“优秀”组的人数也更多。

如何检验我的假设？在现实世界的问题中可以采取哪些方法？

1个回答

据我了解，你有因素（“坏”、“好”等）和连续的“邀请”。如果你想比较两组，你可以使用 t 检验（例如 Wilcoxon）。如果你想比较所有这些组，您可以使用以下形式的简单线性回归：

i n v i t a t i o n s = β_{0} s a t i s f a c t i o n_{1} + β_{1} s a t i s f a c t i o n_{2} + . . . + u .

$invitations = \beta_0 satisfaction_1 + \beta_1 satisfaction_2 + ... + u.$

示例：

library("e1071")
iris = iris

table(iris$Species)
#iris = iris[!(iris$Species=="versicolor"),]

library(dplyr)

iris %>%
  group_by(Species) %>%
  summarise_at(vars(Sepal.Length), funs(mean(., na.rm=TRUE)))

结果（手段）：

# A tibble: 3 x 2
  Species    Sepal.Length
  <fct>             <dbl>
1 setosa             5.01
2 versicolor         5.94
3 virginica          6.59

比较两组：

# Two-samples Wilcoxon test
wilcox.test(iris$Sepal.Length[iris$Species=="setosa"], iris$Sepal.Length[iris$Species=="virginica"])
# The p-value is less than the significance level alpha = 0.05. We can conclude that Sepal Length is significantly different

结果：

    Wilcoxon rank sum test with continuity correction

data:  iris$Sepal.Length[iris$Species == "setosa"] and iris$Sepal.Length[iris$Species == "virginica"]
W = 38.5, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

回归：

# Simple linear regression 
summary(lm(Sepal.Length~Species, data=iris))
# p-values are smaller than 0.05 which means each factor's contribution is statistically different from the intercept

结果：

Call:
lm(formula = Sepal.Length ~ Species, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6880 -0.3285 -0.0060  0.3120  1.3120 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared:  0.6187,    Adjusted R-squared:  0.6135 
F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

这里有趣的一点是Pr(>|t|)。如果此列中的数字小于 0.05，则可以说该因子与截距（这是基本类别，在本例中为“setosa”）显着不同。

在此应用程序中，该列Estimate直接为您提供截距的“setosa”平均值。“杂色”的效果是 0.9300，其中 5.0060+0.9300=5.936，这是“杂色”的平均值，依此类推。

其它你可能感兴趣的问题

上一篇当你没有任何负集时，如何找到与正集相似的点？下一篇图像中全身手势识别的最先进方法是什么