数据挖掘 - 将大型数据集中的每 n 行转置为列 - 吾爱随笔录

将大型数据集中的每 n 行转置为列

数据挖掘大数据数据集数据格式

2022-02-19 16:37:05

我第一次尝试在 SAS 中使用非常大的数据集（约 150 万行），但遇到了一些困难。我拥有的数据集被格式化为“长”.txt 文件，如下所示：

'cat1/: Topic1_Variable1'
'cat2/: Topic1_Variable2'
'cat3/: Topic1_Variable3'
'cat4/: Topic1_Variable4'

'cat1/: Topic2_Variable1'
'cat2/: Topic2_Variable2'
'cat3/: Topic2_Variable3'
'cat4/: Topic2_Variable4'

'cat1/: Topic3_Variable1'
'cat2/: Topic3_Variable2'
'cat3/: Topic3_Variable3'
'cat4/: Topic3_Variable4'
...

为了分析和分享给别人，我真的很想看到它的格式如下：

cat1              cat2              cat3              cat4
Topic1_Variable1  Topic1_Variable2  Topic1_Variable3  Topic1_Variable4
Topic2_Variable1  Topic2_Variable2  Topic2_Variable3  Topic2_Variable4
Topic3_Variable1  Topic3_Variable2  Topic3_Variable3  Topic3_Variable4

我认为这在 R 中可能更容易，但老实说，我在 SAS 中完全空白。我什至玩过 MS Access 试图让它看起来像我想要的那样，但程序每次都会崩溃（由于大小？）。无论如何，我已经研究了 PROC TRANSPOSE 和 PROC SQL 中的一些语句，但似乎这些过程中的大多数函数都用于组合重复的“主题”。在我提供的数据中，每个“组”代表一个对数千个人重复的问题的个人回答，我想保留每次出现的独立性，而不是执行 PROC SQL 中定义的 UNION。在这一点上，我觉得我想得太多了，但我就是无法绕过心理障碍，真正去做我正在努力的事情。非常感谢任何帮助或指导。一世'

3个回答

嘿，这里有 Python 或其他工具的选择吗？由于您提到它是一个大型数据集，您可能希望对其进行迭代，而不是一次加载所有数据集。

这是Python中的一个解决方案：

import pandas as pd
from collections import defaultdict


inputs = [
'cat1/: Topic1_Variable1',
'cat2/: Topic1_Variable2',
'cat3/: Topic1_Variable3',
'cat4/: Topic1_Variable4',
'cat1/: Topic2_Variable1',
'cat2/: Topic2_Variable2',
'cat3/: Topic2_Variable3',
'cat4/: Topic2_Variable4',
'cat1/: Topic3_Variable1',
'cat2/: Topic3_Variable2',
'cat3/: Topic3_Variable3',
'cat4/: Topic3_Variable4',]


outputs = defaultdict(list)

for item in inputs:
    cat, topic = item.split('/: ')
    outputs[cat].append(topic)

print pd.DataFrame(outputs)

输出：

               cat1              cat2              cat3              cat4
0  Topic1_Variable1  Topic1_Variable2  Topic1_Variable3  Topic1_Variable4
1  Topic2_Variable1  Topic2_Variable2  Topic2_Variable3  Topic2_Variable4
2  Topic3_Variable1  Topic3_Variable2  Topic3_Variable3  Topic3_Variable4

在 RI 中强烈建议使用 reshape2 包，特别是功能的 cast/melt 组合。

自从我使用 SAS 以来已经有一段时间了，但我认为您可以使用一个数据步骤，在其中创建一个可以聚合的 ID var，然后为每个“cat1-4”创建 var。然后，您可以使用 proc transpose 或在 ID 变量上使用“groupby”语句执行 proc SQL sum()。

所以第一步是：

'Topic_ID' | 'Cat1' | 'Cat2' | 'Cat3' | 'Cat4'
     1     |   1.5  |    0   |    0   |    0
     1     |    0   |    3   |    0   |    0
     1     |    0   |    0   |    1   |    0
     1     |    0   |    0   |    0   |    4
     2     |    3   |    0   |    0   |    0
    ...

如果您的数据集中没有明确的 Topic_number，您始终可以按 floor(obs_#/4) 为每个观察值计算它。

然后通过使用group by 执行 Proc SQL sum() ，您可以将数据简化为如下所示

'Topic_ID' | 'Cat1' | 'Cat2' | 'Cat3' | 'Cat4'
     1     |   1.5  |    3   |    1   |    4
     2     |    3   |   ...  |  ...   |  ...

这不一定是最有效的方法，但使用 SAS 很容易实现。

其它你可能感兴趣的问题

上一篇测量具有相同基数的集合的相似性下一篇在不使用 row.name 的情况下组合数据集