数据挖掘 - 如何在 keras 中使用字符串分类特征的一种热编码？ - 吾爱随笔录

如何在 keras 中使用字符串分类特征的一种热编码？

数据挖掘分类喀拉斯特征工程编码

2021-09-21 05:59:50

我正在处理一个二进制分类问题。我的数据集的输出列已经编码为 0/1。问题是我有很多分类特征（列），它们是字符串，我想对它们进行一次性编码。

我有 18 个特征（很少有特征是整数，其他是字符串，分类的）和 1 个输出列。

我试过这个：

dataframe = pd.read_csv('basic_df_export.csv', sep=';', encoding = 'ISO-8859-1', header=None) 

dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:17]
Y = dataset[:,17]

# define example
encoded = to_categorical(X)
print(encoded)

但它不起作用，给我这个错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-318da09d7033> in <module>()
      9 
     10 # define example
---> 11 encoded = to_categorical(X)
     12 print(encoded)

~/anaconda/lib/python3.6/site-packages/keras/utils/np_utils.py in to_categorical(y, num_classes)
     21         is placed last.
     22     """
---> 23     y = np.array(y, dtype='int')
     24     input_shape = y.shape
     25     if input_shape and input_shape[-1] == 1 and len(input_shape) > 1:

ValueError: invalid literal for int() with base 10: 'photo'

2个回答

对于字符串数据，使用get_dummies()(from Pandas)。to_categorical()将整数作为输入。

Keras: to_categorical()和 Pandas: 之间有两个重要区别 get_dummies()。

喀拉斯： to_categorical()

to_categorical() 将整数作为输入（不允许使用字符串）。
to_categorical() 默认情况下从 0 开始生成假人！

查看帮助功能：

print(help(to_categorical))

说：

to_categorical(y, num_classes=None, dtype='float32')
    Converts a class vector (integers) to binary class matrix.

    E.g. for use with categorical_crossentropy.

    # Arguments
        y: class vector to be converted into a matrix
            (integers from 0 to num_classes).
        num_classes: total number of classes.
        dtype: The data type expected by the input, as a string
            (`float32`, `float64`, `int32`...)
...

因此，如果您的数据是数字（int），您可以使用to_categorical(). 您可以通过查看.dtype和/或来检查您的数据是否为 np.array type()。

import numpy as np
npa = np.array([2,2,3,3,4,4])
print(npa.dtype, type(npa))
print(npa)

结果：

int32 <class 'numpy.ndarray'>
[2 2 3 3 4 4]

现在您可以使用to_categorical()：

from keras.utils import to_categorical
cat1 = to_categorical(npa)
print(cat1.dtype, type(cat1))
print(cat1)

产生一个矩阵：

float32 <class 'numpy.ndarray'>
[[0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1.]]

请注意，矩阵包含五列（从零开始到四，这是我在中的最大值np.array）。前两列（代表原始数据中的 0 和 1）在整个矩阵中为 0，因为在原始数据中找不到这些值。

to_categorical()还接受未明确定义为 np.array 的输入。例如，下面的陈述也是合法的。

alt1 = to_categorical([0,0,1,1,2,2])
print(alt1.dtype, type(alt1))
print(alt1)

alt2 = to_categorical((0,0,1,1,2,2))
print(alt2.dtype, type(alt2))
print(alt2)

因为现在值的范围在 0 到 2 之间，所以结果如下所示：

[[1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]]

熊猫： get_dummies()

当您拥有时Pandas df，您可以使用将某些列转换为虚拟对象get_dummies()，而不管列中的数据类型如何。因此，也可以将一列字符串转换为虚拟对象。

import pandas as pd
df = pd.DataFrame(data={'col1':["A", "A", "B", "B", "C", "C"]})
alt3 = pd.get_dummies(df['col1'])
print(type(alt3))

这给出了：

<class 'pandas.core.frame.DataFrame'>
   A  B  C
0  1  0  0
1  1  0  0
2  0  1  0
3  0  1  0
4  0  0  1
5  0  0  1

请注意，结果（再次）是 a Pandas df。所以我们需要将其转换为np.array.

alt3 = alt3.to_numpy()
print(alt3.dtype, type(alt3))
print(alt3)

这产生：

uint8 <class 'numpy.ndarray'>
[[1 0 0]
 [1 0 0]
 [0 1 0]
 [0 1 0]
 [0 0 1]
 [0 0 1]]

这样它就可以与Keras.

请注意，此处生成的矩阵不是 (!) 从零开始的。相反，所选Pandas列中的每个不同值都会在虚拟矩阵中获得它自己的列。

尝试：

X = dataset[:,0:17].astype(float).astype(int)

我认为如果你有一个像'45.2'这样的字符串，你必须首先将它转换为浮点数，然后从浮点数中将它们转换为整数。

如果编辑可以证实/纠正这个答案，我会很高兴。

其它你可能感兴趣的问题

上一篇使用 xgboost 进行序数分类下一篇FREAK 特征提取 OpenCV