如何对多个分类列执行一次热编码

数据挖掘 scikit-学习 熊猫
2021-09-29 04:08:37

我正在尝试对某些分类列执行一次性编码。从我正在关注的教程中,我应该在 One hot encoding 之前执行 LabelEncoding。我已经成功执行了labelencoding,如下所示

#categorical data
categorical_cols = ['a', 'b', 'c', 'd'] 
from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()
# apply le on categorical feature columns
data[categorical_cols] = data[categorical_cols].apply(lambda col: le.fit_transform(col))

现在我被困在如何执行一种热编码,然后将编码的列加入数据帧(数据)。

请问我该怎么做?

3个回答

LabelEncoder 不是为了转换数据而是为了转换目标(也称为标签),如此所述。如果你想对数据进行编码,你应该使用 OrdinalEncoder。

如果你真的需要这样做:

categorical_cols = ['a', 'b', 'c', 'd'] 

from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()

# apply le on categorical feature columns
data[categorical_cols] = data[categorical_cols].apply(lambda col: le.fit_transform(col))    
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()

#One-hot-encode the categorical columns.
#Unfortunately outputs an array instead of dataframe.
array_hot_encoded = ohe.fit_transform(data[categorical_cols])

#Convert it to df
data_hot_encoded = pd.DataFrame(array_hot_encoded, index=data.index)

#Extract only the columns that didnt need to be encoded
data_other_cols = data.drop(columns=categorical_cols)

#Concatenate the two dataframes : 
data_out = pd.concat([data_hot_encoded, data_other_cols], axis=1)

否则:

如果您想从原始数据实现单热编码(之前不必使用 OrdinalEncoder),我建议您使用pandas.get_dummies :

#categorical data
categorical_cols = ['a', 'b', 'c', 'd'] 

#import pandas as pd
df = pd.get_dummies(data, columns = categorical_cols)

您还可以使用drop_first参数来删除一个单热编码列,因为某些模型需要。

您可以使用 Pandas 进行虚拟编码以获得 one-hot 编码,如下所示:

import pandas as pd

# Multiple categorical columns
categorical_cols = ['a', 'b', 'c', 'd']

pd.get_dummies(data, columns=categorical_cols)

如果你想使用 sklearn 库进行 one-hot 编码,你可以完成它,如下所示:

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder()

transformed_data = onehotencoder.fit_transform(data[categorical_cols])

# the above transformed_data is an array so convert it to dataframe
encoded_data = pd.DataFrame(transformed_data, index=data.index)

# now concatenate the original data and the encoded data using pandas
concatenated_data = pd.concat([data, encoded_data], axis=1)

如果单个列的类别超过 500 个,那么前面提到的 one-hot 编码方式就不是一个好方法在这种情况下,我们可以对特定列中出现最多的前 10 或 20 个类别进行 one-hot 编码。示例代码如下所示:

categorical_cols = ['a', 'b', 'c', 'd']

# Let's say we have a column 'b' which has more than 500 categories.
# Find the top 10 most frequent categories for column 'b'
data.b.value_counts().sort_values(ascending = False).head(20)

# make a list of the most frequent categories of the column
top_10_occurring_cat = [cat for cat in data.b.value_counts().sort_values(ascending = False).head(10).index]

# now make the 10 binary variables
for cat in top_10_occurring_cat:
    data[cat] = np.where(data['b'] == cat, 1, 0) # whenever data['b'] == cat replace it with 1 else 0

# This is done for one categorical column, similarly you can repeat for all categorical columns

创建具有多个 one-hot-encoded 列的 Pandas DataFrame

假设您有一个 Pandas 数据框flags,其中包含许多列要进行一次热编码。

您需要一个 Pandas 数据框flags_ohe,它与 具有相同的列flags,但列'Mainhue', 'Landmass','Zone','Language','Religion', 'Topleft', 'Botright'被替换为具有明确列名的单热编码版本,例如Mainhue_redMainhue_blue

flags_ohe = flags
categorical_columns = ['Landmass','Zone','Language','Religion', 
                       'Mainhue', 'Topleft', 'Botright']
for col in categorical_columns:
    col_ohe = pd.get_dummies(flags[col], prefix=col)
    flags_ohe = pd.concat((flags_ohe, col_ohe), axis=1).drop(col, axis=1)

这是以前的。

print(flags.columns)

# Output:
# Index(['Name', 'Landmass', 'Zone', 'Area', 'Population', 'Language',
#  'Religion', 'Bars', 'Stripes', 'Colors', 'Red', 'Green', 'Blue', 'Gold',
#  'White', 'Black', 'Orange', 'Mainhue', 'Circles', 'Crosses', 'Saltires',
#  'Quarters', 'Sunstars', 'Crescent', 'Triangle', 'Icon', 'Animate',
#  'Text', 'Topleft', 'Botright'],
# dtype='object')
# dtype='object')

这是之后。

print(flags_ohe.columns)

# Output:
# Index(['Name', 'Area', 'Population', 'Bars', 'Stripes', 'Colors', 'Red',
#  'Green', 'Blue', 'Gold', 'White', 'Black', 'Orange', 'Circles',
#  'Crosses', 'Saltires', 'Quarters', 'Sunstars', 'Crescent', 'Triangle',
#  'Icon', 'Animate', 'Text', 'Landmass_1', 'Landmass_2', 'Landmass_3',
#  'Landmass_4', 'Landmass_5', 'Landmass_6', 'Zone_1', 'Zone_2', 'Zone_3',
#  'Zone_4', 'Language_1', 'Language_2', 'Language_3', 'Language_4',
#  'Language_5', 'Language_6', 'Language_7', 'Language_8', 'Language_9',
#  'Language_10', 'Religion_0', 'Religion_1', 'Religion_2', 'Religion_3',
#  'Religion_4', 'Religion_5', 'Religion_6', 'Religion_7', 'Mainhue_black',
#  'Mainhue_blue', 'Mainhue_brown', 'Mainhue_gold', 'Mainhue_green',
#  'Mainhue_orange', 'Mainhue_red', 'Mainhue_white', 'Topleft_black',
#  'Topleft_blue', 'Topleft_gold', 'Topleft_green', 'Topleft_orange',
#  'Topleft_red', 'Topleft_white', 'Botright_black', 'Botright_blue',
#  'Botright_brown', 'Botright_gold', 'Botright_green', 'Botright_orange',
#  'Botright_red', 'Botright_white'],
# dtype='object')