数据挖掘 - group_by：如何转换我的数据 - 吾爱随笔录

group_by：如何转换我的数据

数据挖掘 r

2022-02-21 14:54:44

我来自 Stata，正在努力适应 dplyr 中的 group_by。也许使用 group_by 是错误的方法，但是如果您了解 Stata，您就会明白我为什么要尝试使用它（我认为）。

我想转换数据，如下所示：

本质上，我试图将每个 NAICS 代码折叠成一行，并为每个 NAICS 代码的 CONC 和 HHI 下的 5 个统计信息创建列（CONC 的 4 个值和 HHI 的一个值，忽略 X）

数据在这里。我正在尝试的代码如下：

ManufMktConc <- read.csv("Manufacturing.csv",
                            stringsAsFactors = FALSE)

ManufMktConc %<>% 
  tbl_df %>% 
  slice(-1) %>% 
  filter(CONCENFI.display.label!="All companies") %>%
  select(-GEO.id,-GEO.id2,-GEO.display.label,-COMPANY,-RCPTOT,-YEAR.id)

ManufMktConc_byInd <- group_by(ManufMktConc,NAICS.id) 

ManufMktConc_byInd %<>% 
      arrange(CONCENFI.id) %>% 
      mutate(FourFirm = ManufMktConc_byInd$CCORCPPCT[ManufMktConc_byInd$CONCENFI.id=="856"])

这给了我以下错误：

> ManufMktConc_byInd %<>% arrange(CONCENFI.id) %>% mutate(FourFirm=ManufMktConc_byInd$CCORCPPCT[ManufMktConc_byInd$CONCENFI.id=="856"])
Error: incompatible size (651), expecting 4 (the group size) or 1

所以我可以说我要求 R 将 651 个实例ManufMktConc_byInd$CONCENFI.id == "856"放入 4 个组中。我想这就是我对 group_by 的理解分崩离析的地方。为什么我的逻辑陈述不只适用于每个组？

谢谢你。

1个回答

ManufMktConc <- read.csv("Downloads/Manufacturing.csv",
                         stringsAsFactors = FALSE)

ManufMktConc %<>% 
  tbl_df %>% 
  slice(-1) %>% 
  filter(CONCENFI.display.label!="All companies") %>%
  select(-GEO.id,-GEO.id2,-GEO.display.label,-COMPANY,-RCPTOT,-YEAR.id)

##  use melt and cast from reshape2 
require(reshape2)

# melt makes the data set tall and thin using id variables and measure variables
# 
ManufMktConc_molten <- as.data.frame(ManufMktConc) %>%
  melt(id.vars=1:4, measure.vars=5:6) %>%
  filter(value!="X")
ManufMktConc_molten[1:5,]

    NAICS.id       NAICS.display.label CONCENFI.id CONCENFI.display.label  variable value
1      311        Food manufacturing         856    4 largest companies CCORCPPCT  16.3
2      311        Food manufacturing         857    8 largest companies CCORCPPCT  24.2
3      311        Food manufacturing         858   20 largest companies CCORCPPCT  38.4
4      311        Food manufacturing         859   50 largest companies CCORCPPCT  50.9
5     3111 Animal food manufacturing         856    4 largest companies CCORCPPCT  30.2



# make a new column with the eventual column header. (Note this is different that your example)
ManufMktConc_molten$label <- paste0(ManufMktConc_molten$variable,
                                    trimws(substr(ManufMktConc_molten$CONCENFI.display.label,1,2)))

# cast it into multiple columns (something like a pivot in Excel).
ManufMktConc_result  <- ManufMktConc_molten %>%
  cast(NAICS.id + NAICS.display.label ~ label) %>%
  select(1,2,4,6,3,5,7)     ## reorder columns

ManufMktConc_result[1:5,]

      NAICS.id             NAICS.display.label CCORCPPCT4 CCORCPPCT8 CCORCPPCT20 CCORCPPCT50 VSHERFI50
    1      311              Food manufacturing       16.3       24.2        38.4        50.9     110.7
    2     3111       Animal food manufacturing       30.2       40.7        57.8        71.5     368.6
    3    31111       Animal food manufacturing       30.2       40.7        57.8        71.5     368.6
    4   311111  Dog and cat food manufacturing       67.8       80.6        89.6        96.5    2019.4
    5   311119 Other animal food manufacturing       24.3       36.2        51.5        68.2     228.3

其它你可能感兴趣的问题

上一篇我应该如何处理音乐播放数据下一篇为什么全基因组测序服务的价格范围如此之大？