我是数据科学的新手,并试图获得一些结果。我正在申请Decision Tree Classifier
。当我的训练数据集和测试数据集的大小不相等时,我会收到错误消息“模型的特征数必须与输入匹配。模型 n_features 是N(训练数据集中的条目数),输入 n_features 是X(测试数据集中的条目数)。
如果我的数据集中有 100 个条目,并且拆分参数为test_size=0.30
:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split
data=pd.read_csv("ndata.csv")
X_train, X_test, y_train, y_test = train_test_split(data.dis, data.gen, test_size=0.30, random_state=42)
c = tree.DecisionTreeClassifier()
y_test_size = y_test.size
y_train_size = y_train.size
X_train = [X_train]
y_train = [y_train]
X_test = [X_test]
y_test = [y_test]
c.fit(X_train, y_train)
accu_train = np.sum(c.predict(X_train) == y_train)/y_train_size
accu_test = np.sum(c.predict(X_test) == y_test)/y_test_size
print("Accuracy on Train: ", accu_train)
print("Accuracy on Test: ", accu_test)
并且错误发生如下:
ValueError Traceback (most recent call last)
<ipython-input-33-f6cc77390526> in <module>()
24
25 accu_train = np.sum(c.predict(X_train) == y_train)/y_train_size
---> 26 accu_test = np.sum(c.predict(X_test) == y_test)/y_test_size
27
28 print("Accuracy on Train: ", accu_train)
~/anaconda3/lib/python3.6/site-packages/sklearn/tree/tree.py in predict(self, X, check_input)
410 """
411 check_is_fitted(self, 'tree_')
--> 412 X = self._validate_X_predict(X, check_input)
413 proba = self.tree_.predict(X)
414 n_samples = X.shape[0]
~/anaconda3/lib/python3.6/site-packages/sklearn/tree/tree.py in _validate_X_predict(self, X, check_input)
382 "match the input. Model n_features is %s and "
383 "input n_features is %s "
--> 384 % (self.n_features_, n_features))
385
386 return X
ValueError: Number of features of the model must match the input. Model n_features is 70 and input n_features is 30
数据文件链接:https ://gist.github.com/mutafaf/7715ad67bc3cf4e08985afefcc0ce08a#file-ndata-csv
为什么我会收到此错误。是否有必要让训练和测试的数据集大小相等?