数据挖掘 - 如何找到分类变量之间的关系？ - 吾爱随笔录

如何找到分类变量之间的关系？

数据挖掘 r 统计数据

2021-09-16 06:10:23

我有一个数据集，其中有Outcome一个案件和Country正在审查的案件，以及审查案件的法官（在一个法官小组中）是否也来自被审查的国家Country_Judge，分类为TRUE或FALSE。

如何衡量案例的Country_Judge和之间的关系Outcome？我想知道法官的国籍是否对案件的结果有影响。

1个回答

Outcome也是布尔变量吗？如果是这样，一个简单的prop.test就可以了。

这是一个玩具数据集，来自同一国家的法官不太可能做出有罪判决。

library(tidyverse)
n<-1000
dataset<-tibble(country_judge = sample(c(TRUE,FALSE), n, 
                                           replace=T, prob=c(0.2,0.8))) %>%
  mutate(outcome = ifelse(country_judge,
                          sample(c("Guilty", "Innocent"), n, 
                                     replace=T, prob=c(0.4,0.6)),
                          sample(c("Guilty", "Innocent"), n, 
                                     replace=T, prob=c(0.5,0.5))))

dataset %>%
  group_by(country_judge) %>%
  summarise(p_guilty=mean(outcome=="Guilty"))

这将给出如下内容：

# A tibble: 2 x 2
  country_judge  p_guilty
          <lgl>     <dbl>
1         FALSE 0.5108835
2          TRUE 0.3698630

现在，提取试验和“成功”的向量，并将它们输入prop.test.

trials <- dataset %>%
  group_by(country_judge) %>%
  count() %>%
  pull(n)

successes <- dataset %>%
  filter(outcome=="Guilty") %>%
  group_by(country_judge) %>%
  count() %>%
  pull(n)

prop.test(successes, trials)

这给出了类似的东西：

    2-sample test for equality of proportions with continuity correction

data:  successes out of trials
X-squared = 13.068, df = 1, p-value = 0.0003003
alternative hypothesis: two.sided
95 percent confidence interval:
 0.06517776 0.21686317
sample estimates:
   prop 1    prop 2 
0.5108835 0.3698630

其它你可能感兴趣的问题

上一篇训练损失与验证损失下一篇SOM 如何使高维数据的可视化成为可能？