数据挖掘 - Pandas 中的大规模转换分类列（不是 one-hot 编码） - 吾爱随笔录

Pandas 中的大规模转换分类列（不是 one-hot 编码）

数据挖掘 scikit-学习熊猫分类数据标签

2021-10-05 03:56:42

我有带有大量分类列的 pandas 数据框，我计划将其用于带有 scikit-learn 的决策树。我需要将它们转换为数值（不是一个热向量）。我可以使用 scikit-learn 的 LabelEncoder 来完成。问题是它们太多了，我不想手动转换它们。

什么是自动化这个过程的简单方法。

3个回答

如果您的分类列当前是字符/对象，您可以使用类似的方法来执行每个：

char_cols = df.dtypes.pipe(lambda x: x[x == 'object']).index

for c in char_cols:
    df[c] = pd.factorize(df[c])[0]

如果您需要能够返回到类别，我会创建一个字典来保存编码；就像是：

char_cols = df.dtypes.pipe(lambda x: x[x == 'object']).index
label_mapping = {}

for c in char_cols:
    df[c], label_mapping[c] = pd.factorize(df[c])

使用 Julien 的 mcve 将输出：

In [3]: print(df)
Out[3]: 
    a   b   c   d
0   0   0   0   0.155463
1   1   1   1   0.496427
2   0   0   2   0.168625
3   2   0   1   0.209681
4   0   2   1   0.661857

In [4]: print(label_mapping)
Out[4]:
{'a': Index(['Var2', 'Var3', 'Var1'], dtype='object'),
 'b': Index(['Var2', 'Var1', 'Var3'], dtype='object'),
 'c': Index(['Var3', 'Var2', 'Var1'], dtype='object')}

首先，让我们创建一个mcve来玩：

import pandas as pd
import numpy as np

In [1]: categorical_array = np.random.choice(['Var1','Var2','Var3'],
                                             size=(5,3), p=[0.25,0.5,0.25])
        df = pd.DataFrame(categorical_array,
               columns=map(lambda x:chr(97+x), range(categorical_array.shape[1])))
        # Add another column that isn't categorical but float
        df['d'] = np.random.rand(len(df))
        print(df)

Out[1]:
      a     b     c         d
0  Var3  Var3  Var3  0.953153
1  Var1  Var2  Var1  0.924896
2  Var2  Var2  Var2  0.273205
3  Var2  Var1  Var3  0.459676
4  Var2  Var1  Var1  0.114358

现在我们可以使用pd.get_dummies对前三列进行编码。

请注意，我使用该drop_first参数是因为N-1假人足以完全描述N可能性（例如：如果a_Var2和a_Var3为 0，则为a_Var1）。另外，我专门指定了列，但我不必这样做，因为它将是具有 dtypeobject或categorical（更多信息见下文）的列。

In [2]: df_encoded = pd.get_dummies(df, columns=['a','b', 'c'], drop_first=True)
        print(df_encoded]
Out[2]:
          d  a_Var2  a_Var3  b_Var2  b_Var3  c_Var2  c_Var3
0  0.953153       0       1       0       1       0       1
1  0.924896       0       0       1       0       0       0
2  0.273205       1       0       1       0       1       0
3  0.459676       1       0       0       0       0       1
4  0.114358       1       0       0       0       0       0

在您的特定应用程序中，您必须提供分类列的列表，或者您必须推断哪些列是分类的。

最佳情况下，您的数据框已经有这些带有 a 的列dtype=category，您可以传递columns=df.columns[df.dtypes == 'category']给get_dummies.

否则，我建议根据需要设置dtype所有其他列的（提示：pd.to_numeric、pd.to_datetime 等），您将留下具有 dtype 的列，object这些列应该是您的分类列。

pd.get_dummies 参数列默认如下：

columns : list-like, default None
    Column names in the DataFrame to be encoded.
    If `columns` is None then all the columns with
    `object` or `category` dtype will be converted.

为了一次转换多列的类型，我会使用这样的东西：

df2 = df.select_dtypes(include = ['type_of_insterest'])

df2[df2.columns].apply(lambda x:x.astype('category'))

然后我会和他们一起回到original df.

其它你可能感兴趣的问题

上一篇在 XGBoost 中，我们会使用 Precision Recall 曲线与 ROC 来评估结果吗？下一篇如何合并月、日、周数据？