group_by:如何转换我的数据

数据挖掘 r
2022-02-21 14:54:44

我来自 Stata,正在努力适应 dplyr 中的 group_by。也许使用 group_by 是错误的方法,但是如果您了解 Stata,您就会明白我为什么要尝试使用它(我认为)。

我想转换数据,如下所示:

在此处输入图像描述

本质上,我试图将每个 NAICS 代码折叠成一行,并为每个 NAICS 代码的 CONC 和 HHI 下的 5 个统计信息创建列(CONC 的 4 个值和 HHI 的一个值,忽略 X)

数据在这里我正在尝试的代码如下:

ManufMktConc <- read.csv("Manufacturing.csv",
                            stringsAsFactors = FALSE)

ManufMktConc %<>% 
  tbl_df %>% 
  slice(-1) %>% 
  filter(CONCENFI.display.label!="All companies") %>%
  select(-GEO.id,-GEO.id2,-GEO.display.label,-COMPANY,-RCPTOT,-YEAR.id)

ManufMktConc_byInd <- group_by(ManufMktConc,NAICS.id) 

ManufMktConc_byInd %<>% 
      arrange(CONCENFI.id) %>% 
      mutate(FourFirm = ManufMktConc_byInd$CCORCPPCT[ManufMktConc_byInd$CONCENFI.id=="856"])

这给了我以下错误:

> ManufMktConc_byInd %<>% arrange(CONCENFI.id) %>% mutate(FourFirm=ManufMktConc_byInd$CCORCPPCT[ManufMktConc_byInd$CONCENFI.id=="856"])
Error: incompatible size (651), expecting 4 (the group size) or 1

所以我可以说我要求 R 将 651 个实例ManufMktConc_byInd$CONCENFI.id == "856"放入 4 个组中。我想这就是我对 group_by 的理解分崩离析的地方。为什么我的逻辑陈述不只适用于每个组?

谢谢你。

1个回答
ManufMktConc <- read.csv("Downloads/Manufacturing.csv",
                         stringsAsFactors = FALSE)

ManufMktConc %<>% 
  tbl_df %>% 
  slice(-1) %>% 
  filter(CONCENFI.display.label!="All companies") %>%
  select(-GEO.id,-GEO.id2,-GEO.display.label,-COMPANY,-RCPTOT,-YEAR.id)

##  use melt and cast from reshape2 
require(reshape2)

# melt makes the data set tall and thin using id variables and measure variables
# 
ManufMktConc_molten <- as.data.frame(ManufMktConc) %>%
  melt(id.vars=1:4, measure.vars=5:6) %>%
  filter(value!="X")
ManufMktConc_molten[1:5,]

    NAICS.id       NAICS.display.label CONCENFI.id CONCENFI.display.label  variable value
1      311        Food manufacturing         856    4 largest companies CCORCPPCT  16.3
2      311        Food manufacturing         857    8 largest companies CCORCPPCT  24.2
3      311        Food manufacturing         858   20 largest companies CCORCPPCT  38.4
4      311        Food manufacturing         859   50 largest companies CCORCPPCT  50.9
5     3111 Animal food manufacturing         856    4 largest companies CCORCPPCT  30.2



# make a new column with the eventual column header. (Note this is different that your example)
ManufMktConc_molten$label <- paste0(ManufMktConc_molten$variable,
                                    trimws(substr(ManufMktConc_molten$CONCENFI.display.label,1,2)))

# cast it into multiple columns (something like a pivot in Excel).
ManufMktConc_result  <- ManufMktConc_molten %>%
  cast(NAICS.id + NAICS.display.label ~ label) %>%
  select(1,2,4,6,3,5,7)     ## reorder columns

ManufMktConc_result[1:5,]

      NAICS.id             NAICS.display.label CCORCPPCT4 CCORCPPCT8 CCORCPPCT20 CCORCPPCT50 VSHERFI50
    1      311              Food manufacturing       16.3       24.2        38.4        50.9     110.7
    2     3111       Animal food manufacturing       30.2       40.7        57.8        71.5     368.6
    3    31111       Animal food manufacturing       30.2       40.7        57.8        71.5     368.6
    4   311111  Dog and cat food manufacturing       67.8       80.6        89.6        96.5    2019.4
    5   311119 Other animal food manufacturing       24.3       36.2        51.5        68.2     228.3