如何找到分类变量之间的关系?

数据挖掘 r 统计数据
2021-09-16 06:10:23

我有一个数据集,其中有Outcome一个案件​​和Country正在审查的案件,以及审查案件的法官(在一个法官小组中)是否也来自被审查的国家Country_Judge,分类为TRUEFALSE

如何衡量案例的Country_Judge和之间的关系Outcome我想知道法官的国籍是否对案件的结果有影响。

1个回答

Outcome也是布尔变量吗如果是这样,一个简单的prop.test就可以了。

这是一个玩具数据集,来自同一国家的法官不太可能做出有罪判决。

library(tidyverse)
n<-1000
dataset<-tibble(country_judge = sample(c(TRUE,FALSE), n, 
                                           replace=T, prob=c(0.2,0.8))) %>%
  mutate(outcome = ifelse(country_judge,
                          sample(c("Guilty", "Innocent"), n, 
                                     replace=T, prob=c(0.4,0.6)),
                          sample(c("Guilty", "Innocent"), n, 
                                     replace=T, prob=c(0.5,0.5))))

dataset %>%
  group_by(country_judge) %>%
  summarise(p_guilty=mean(outcome=="Guilty"))

这将给出如下内容:

# A tibble: 2 x 2
  country_judge  p_guilty
          <lgl>     <dbl>
1         FALSE 0.5108835
2          TRUE 0.3698630

现在,提取试验和“成功”的向量,并将它们输入prop.test.

trials <- dataset %>%
  group_by(country_judge) %>%
  count() %>%
  pull(n)

successes <- dataset %>%
  filter(outcome=="Guilty") %>%
  group_by(country_judge) %>%
  count() %>%
  pull(n)

prop.test(successes, trials)

这给出了类似的东西:

    2-sample test for equality of proportions with continuity correction

data:  successes out of trials
X-squared = 13.068, df = 1, p-value = 0.0003003
alternative hypothesis: two.sided
95 percent confidence interval:
 0.06517776 0.21686317
sample estimates:
   prop 1    prop 2 
0.5108835 0.3698630