数据挖掘 - 预处理后如何缩小预测 - 吾爱随笔录

预处理后如何缩小预测

数据挖掘机器学习 Python 预处理

2022-02-05 09:08:15

所以我是机器学习的新手，目前正在使用 iris 数据集。我浏览了一个关于预测股票价格的快速在线教程，并认为我会尝试自己做鸢尾花。

我遇到的问题是我正在使用预处理来缩放数据以训练我的分类器。但是，当我做出预测时，答案也会按比例缩放。当我注释掉所有的预处理时，我得到了准确的结果。有没有办法缩小预测？

输出四舍五入为 0、1 或 2，每个数字代表三个物种之一。

你可以在下面看到我的代码：

import pandas as pd
import numpy as np
from sklearn import preprocessing, model_selection
from sklearn.linear_model import LinearRegression

df = pd.read_csv("iris.csv")

# setosa - 0
# versicolor - 1
# virginica - 2
df = df.replace("setosa", 0)
df = df.replace("versicolor", 1)
df = df.replace("virginica", 2)

X = np.array(df.drop(['species'], 1))

y = np.array(df['species'])

# Scale features
# X = preprocessing.scale(X)

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)

clf = LinearRegression(n_jobs=1)  # Linear regression clf

clf.fit(X_train, y_train)

confidence = clf.score(X_test, y_test)

print("Confidence: " + confidence)

# Inputs
sepal_length = float(input("Enter sepal length: "))
sepal_width = float(input("Enter sepal width: "))
petal_length = float(input("Enter petal length: "))
petal_width = float(input("Enter petal width: "))

# Create panda data frame with inputted data
index = [0]
d = {'sepal_length': sepal_length, 'sepal_width': sepal_width, 'petal_length': petal_length, 'petal_width': petal_width}
predict_df = pd.DataFrame(data=d, index=index)

# Create np array of features
predict_X = np.array(predict_df)

# Need to scale new X feature values
# predict_X = preprocessing.scale(predict_X, axis=1)

# Make a prediction against prediction features
prediction = clf.predict(predict_X)

print(predict_X, prediction)

rounded_prediction = int(round(prediction[0]))

if rounded_prediction == 0:
    print("== Predicted as Setosa ==")
elif rounded_prediction == 1:
    print("== Predicted as Versicolor ==")
elif rounded_prediction == 2:
    print("== Predicted as Virginica ==")
else:
    print("== Unable to make a prediction ==")

这是启用预处理的我的输出示例。我将使用 CSV 中的一条线作为示例（6.4 萼片长度、3.2 萼片宽度、4.5 花瓣长度和 1.5 花瓣宽度），它应该等于杂色物种 (1)：

Confidence: 0.9449475378336242
Enter sepal length: 6.4
Enter sepal width: 3.2
Enter petal length: 4.5
Enter petal width: 1.5
[[ 1.39427847 -0.39039797  0.33462683 -1.33850733]] [0.41069281]
== Predicted as Setosa ==

现在预处理注释掉了：

Confidence: 0.9132522144785978
Enter sepal length: 6.4
Enter sepal width: 3.2
Enter petal length: 4.5
Enter petal width: 1.5
[[6.4 3.2 4.5 1.5]] [1.29119283]
== Predicted as Versicolor ==

看来我要么做错了预处理，要么我错过了一个额外的步骤。如果我弄错了一些术语，我很抱歉，并提前感谢您的回答。

2个回答

当您决定必须扩展数据时，通常必须执行以下步骤：

为了训练：

缩放/标准化训练集
存储训练集的缩放/标准化因子
训练模型

对于预测：

缩放/标准化输入数据，但非常重要，在训练过程中存储缩放/标准化因子。您不必计算新数据的最小值、最大值或平均值。
预测

原因是您必须将新数据映射到用于训练过程的相同特征空间，因此您必须使用相同的因子对其进行缩放/标准，否则，您正在更改特征空间。

我认为您的方法是正确的，但是这一行：

# Scale features
# X = preprocessing.scale(X)

应改为：

# Scale features
# X = preprocessing.scale(X, axis = 1)

因为 scale 的默认设置是将轴设置为 0（我想知道为什么！）。如果问题仍然存在，请发表评论，我将进行编辑。

编辑

虽然你的方法没有错，但是使用 sklearn StandardScaler 更合适。请参阅此类的文档。通常，最好将定标器与训练数据拟合，并根据该拟合转换测试数据。

其它你可能感兴趣的问题

上一篇LSTM RNN遗忘门是如何计算的？下一篇计算机视觉深度学习算法的最佳图像 PPI