数据挖掘 - 单个特征的多个分类值如何使用python将它们转换为二进制 - 吾爱随笔录

单个特征的多个分类值如何使用python将它们转换为二进制

数据挖掘机器学习 Python scikit-学习

2021-10-11 05:23:33

我有一个包含 28 列的电影数据集。其中之一是流派。对于该数据集中的每一行，列类型的值的形式为“动作|动画|喜剧|家庭|幻想”。我想使用 pandas.get_dummies() 对它们进行编码，但由于列有多个值，如何处理这种情况？以下链接的附加信息（问题从stackoverflow移动） https://stackoverflow.com/q/40331558/4028904

1个回答

我从以下数据集开始：

import pandas as pd
data = pd.DataFrame({'title': ['Avatar', 'Pirates', 'Spectre', 'Batman'],
                 'genres': ['Action|Adventure|Fantasy|Sci-Fi',
                            'Action|Adventure|Fantasy',
                            'Action|Adventure|Thriller',
                            'Action|Thriller']},
                columns=['title', 'genres'])


     title                           genres
0   Avatar  Action|Adventure|Fantasy|Sci-Fi
1  Pirates         Action|Adventure|Fantasy
2  Spectre        Action|Adventure|Thriller
3   Batman                  Action|Thriller

首先，您希望将数据保存在一个结构中，一次将标题与一种类型配对，每个标题多行。您可以通过这样的系列获得它：

cleaned = data.set_index('title').genres.str.split('|', expand=True).stack()


title
Avatar   0       Action
         1    Adventure
         2      Fantasy
         3       Sci-Fi
Pirates  0       Action
         1    Adventure
         2      Fantasy
Spectre  0       Action
         1    Adventure
         2     Thriller
Batman   0       Action
         1     Thriller
dtype: object

（我们不想要一个额外的索引级别，但我们很快就会摆脱它。）get_dummies现在可以工作了，但它一次只能在一行上工作，所以我们需要重新聚合标题：

pd.get_dummies(cleaned, prefix='g').groupby(level=0).sum()


         g_Action  g_Adventure  g_Fantasy  g_Sci-Fi  g_Thriller
title
Avatar        1.0          1.0        1.0       1.0         0.0
Batman        1.0          0.0        0.0       0.0         1.0
Pirates       1.0          1.0        1.0       0.0         0.0
Spectre       1.0          1.0        0.0       0.0         1.0

其它你可能感兴趣的问题

上一篇为什么学习率会导致我的神经网络的权重飙升？下一篇通过遍历 pandas 数据框中的行来创建新列