数据挖掘 - data.table R（或 Python）中的行到列 - 吾爱随笔录

data.table R（或 Python）中的行到列

数据挖掘 Python r 数据集重塑数据表

2021-10-16 04:04:18

这是我使用 R 的 reshape2 库无法实现的。我有以下数据：

 zone       code        literal
 1: A         14           bicl
 2: B         14           bicl
 3: B         24          calso
 4: A         51           mara
 5: B         51           mara
 6: A         125           gan
 7: A         143          carc
 8: B         143          carc

即：每个区域都有 4 个代码及其对应的文字。我想将其转换为一个数据集，其中四个代码各有一列，四个文字各一列：

 zone  code1 literal1   code2 literal2   code3 literal3   code4 literal4    
 1: A    14     bicl      51     mara     125      gan      143   carc
 2: B    14     bicl      24    calso      51     mara      143   carc

有什么简单的方法可以在 R 中实现这一点吗？如果没有，我也会对 Python 中的解决方案感到满意。

4个回答

纯R中只有两行

X <- read.table(header = TRUE, text = "
zone       code        literal
 A         14           bicl
 B         14           bicl
 B         24          calso
 A         51           mara
 B         51           mara
 A         125           gan
 A         143          carc
 B         143          carc")

X$time <- ave(X$code, X$zone, FUN = seq_along)
reshape(X, direction = "wide", timevar = "time", idvar = "zone", sep = "")

# output
  zone code1 literal1 code2 literal2 code3 literal3 code4 literal4
1    A    14     bicl    51     mara   125      gan   143     carc
2    B    14     bicl    24    calso    51     mara   143     carc

这是一个 python 解决方案，给定一个df包含上述数据的数据框 ( )：

>>> from itertools import chain
>>> data = []
>>> for zone in df.zone.unique():
...    codetuples = [(row[2], row[3]) for row in df[df['zone']==zone].itertuples()]
...    data.append([zone] + list(chain.from_iterable(codetuples)))
...
>>> df = pandas.DataFrame(data, columns=['zone', 'code1', 'literal1', 'code2', 'literal2', 'code3', 'literal3', 'code4', 'literal4'])
>>> df
  zone   code1 literal1   code2 literal2 code3 literal3  code4 literal4
0    A      14     bicl      51     mara   125      gan    143     carc
1    B      14     bicl      24    calso    51     mara    143     carc

解释

df.itertuples()通过数据帧的行作为元组返回一个迭代器。第一个条目（元组中的 0 索引）将是索引，因此 df 的第 2 列和第 3 列将是您感兴趣的两列。

code1 vs code2的顺序不保证；我将 df 中的数据存储在一个变量中codetuples，以便您可以进行排序或其他操作。也不能保证您将拥有 4 对代码和文字，因此如果需要，您可以在那里进行错误检查。

一旦你有一个可接受的四个元组的列表，就将这个列表from_iterable() 展平。然后将区域编号附加到前面并将其存储为另一个数据帧。

在 python 中使用pandas，您可以使用 .T 转置行和列

以下是使用 Tidyverse 在 R 中实现此目的的一种方法：

data %>% 
mutate(group = rep(1:4, each = 2)) %>% # in this example such is the rule # data-dependent step
gather("key", "value", c("code", "literal")) %>% 
mutate(key = paste0(key, group)) %>% 
dplyr::select(-group) %>% 
spread(key, value) %>% 
dplyr::select(zone, ends_with("1"), ends_with("2"), ends_with("3"), ends_with("4")) # to arrange the column names, because otherwise all the 'code' columns are together and 'literal' columns come after # data-dependent step

其它你可能感兴趣的问题

上一篇构建引文网络以在 R 中进行分析下一篇nltk中实体识别的标签映射是什么？