由于它的帖子很旧,因此此回复可能对其他人有所帮助。
确实,某些算法接受分类格式的数据并在内部转换为 OneHotEncoding。在这种情况下,模型接受原始格式的数据并且不需要任何显式转换处理。
如果不支持,我们必须保存两个模型,即
- 用于编码数据的模型
- 用于预测数据的模型
以更简单的方式,我们也可以将相关模型保存在单个文件中。参考下面的代码片段:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
import pickle as pk
import pandas as pd
import numpy as np
df = pd.read_csv(<"Some_file.csv">) #replace with actual csv file
X = df['Features']
y = df['Labels']
file = open("models.pkl", "wb")
encoder = OneHotEncoder(sparse=False)
oneHotEncodedFeature = encoder.fit_transform(X[<'Categorical_feature'>].values.reshape(-1,1))
pk.dump(encoder, file) #dumping Encoder model
# Some processing for concatenating oneHotEncodedFeature with other features and assume it its X again.
linReg = LinearRegression()
linReg.fit(X,y)
pk.dump(linReg, file) #dumping linear Reg. model
file.close() #Create single pickle file, which has both the trained model.
#For prediction
file = open("models.pkl", "rb")
trained_encoder = pk.load(file) #Pickle file first load the OneHotEncoder
trained_model_for_prediction = pk.load(file) #Reading same pickle again will load the trained Linear Reg Model.