模型的特征数量必须与输入相匹配。模型 n_features 为“N”,输入 n_features 为“X”。

数据挖掘 Python scikit-学习 特征选择 训练
2021-10-02 18:18:20

我是数据科学的新手,并试图获得一些结果。我正在申请Decision Tree Classifier当我的训练数据集和测试数据集的大小不相等时,我会收到错误消息“模型的特征数必须与输入匹配。模型 n_features 是N(训练数据集中的条目数),输入 n_features 是X(测试数据集中的条目数)。

如果我的数据集中有 100 个条目,并且拆分参数为test_size=0.30

import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split



data=pd.read_csv("ndata.csv")

X_train, X_test, y_train, y_test = train_test_split(data.dis, data.gen, test_size=0.30, random_state=42)


c = tree.DecisionTreeClassifier()

y_test_size = y_test.size
y_train_size = y_train.size
X_train = [X_train]
y_train = [y_train]
X_test = [X_test]
y_test = [y_test]

c.fit(X_train, y_train)

accu_train = np.sum(c.predict(X_train) == y_train)/y_train_size
accu_test = np.sum(c.predict(X_test) == y_test)/y_test_size

print("Accuracy on Train: ", accu_train)
print("Accuracy on Test: ", accu_test)

并且错误发生如下:

ValueError                                Traceback (most recent call last)
<ipython-input-33-f6cc77390526> in <module>()
     24 
     25 accu_train = np.sum(c.predict(X_train) == y_train)/y_train_size
---> 26 accu_test = np.sum(c.predict(X_test) == y_test)/y_test_size
     27 
     28 print("Accuracy on Train: ", accu_train)

~/anaconda3/lib/python3.6/site-packages/sklearn/tree/tree.py in predict(self, X, check_input)
    410         """
    411         check_is_fitted(self, 'tree_')
--> 412         X = self._validate_X_predict(X, check_input)
    413         proba = self.tree_.predict(X)
    414         n_samples = X.shape[0]

~/anaconda3/lib/python3.6/site-packages/sklearn/tree/tree.py in _validate_X_predict(self, X, check_input)
    382                              "match the input. Model n_features is %s and "
    383                              "input n_features is %s "
--> 384                              % (self.n_features_, n_features))
    385 
    386         return X

ValueError: Number of features of the model must match the input. Model n_features is 70 and input n_features is 30 

数据文件链接:https ://gist.github.com/mutafaf/7715ad67bc3cf4e08985afefcc0ce08a#file-ndata-csv

为什么我会收到此错误。是否有必要让训练和测试的数据集大小相等?

1个回答

您应该将numpy数组而不是列表作为参数传递给DecisionTree,因为您的输入是一个列表,它被训练为 70 个特征(一维列表),并且您的测试有list30 个元素,分类器将其视为 30 个特征。

尽管如此,您需要重塑输入numpy数组并将其作为矩阵传递

含义:X_train.values.reshape(-1, 1)而不是X_train(它应该是一个numpy数组而不是一个list

这是整个要点:

data=pd.read_csv("ndata.csv")

X_train, X_test, y_train, y_test = train_test_split(data.dis, data.gen, test_size=0.30, random_state=42)

from sklearn import tree

c = tree.DecisionTreeClassifier()
c.fit(X_train.values.reshape(-1, 1), y_train)

accu_train = np.sum(c.predict(X_train.values.reshape(-1, 1)) == y_train)/y_train_size
accu_test = np.sum(c.predict(X_test.values.reshape(-1, 1)) == y_test)/y_test_size

print("Accuracy on Train: ", accu_train)
print("Accuracy on Test: ", accu_test)

我得到以下输出:

Accuracy on Train:  0.8857142857142857
Accuracy on Test:  0.7333333333333333

感谢分享数据集。这对测试很有帮助。