数据挖掘 - 如何对 Pandas 数据帧中的多值分类变量进行二进制编码？ - 吾爱随笔录

如何对 Pandas 数据帧中的多值分类变量进行二进制编码？

数据挖掘 Python 熊猫

2021-10-06 07:27:08

假设我们有以下数据框，其中某个列具有多个值：

    categories
0 - ["A", "B"]
1 - ["B", "C", "D"]
2 - ["B", "D"]

我们怎样才能得到这样的表？

   "A"  "B"  "C"  "D"
0 - 1    1    0    0
1 - 0    1    1    1
2 - 0    1    0    1

注意：我不一定需要新的数据框，我想知道如何将此类数据框转换为更适合机器学习的格式。

1个回答

如果[0, 1, 2]是数字标签而不是索引，则pandas.DataFrame.pivot_table有效：

在 []：
数据 = pd.DataFrame.from_records(
    [[0, 'A'], [0, 'B'], [1, 'B'], [1, 'C'], [1, 'D'], [2, 'B'], [ 2, 'D']],
    columns=['number_label', 'category'])
data.pivot_table(index=['number_label'], columns=['category'], aggfunc=[len], fill_value=0)

出去[]：
              连
类别 ABCD
number_label                       
0 1 1 0 0
1 0 1 1 1
2 0 1 0 1

这篇博文很有帮助。

如果[0, 1, 2]是索引，则collections.Counter很有用：

在 []：
data2 = pd.DataFrame.from_dict(
    {'类别'：{0：['A'，'B']，1：['B'，'C'，'D']，2：['B'，'D']}}）
data3 = data2['categories'].apply(collections.Counter)
pd.DataFrame.from_records(data3).fillna(value=0)

其它你可能感兴趣的问题

上一篇科学家如何提出正确的隐马尔可夫模型参数和拓扑来使用？下一篇可扩展的异常值/异常检测