查询数据维度

数据挖掘 机器学习模型
2022-02-13 15:02:38
import numpy as np
from sklearn import preprocessing, cross_validation, neighbors
import pandas as pd
from sklearn.linear_model import LinearRegression

df = pd.read_csv('Downloads/breast-cancer-wisconsin.data.txt',skiprows=1)
df.replace('?', -99999, inplace=True)
df.drop('id', 1, inplace=True )

X= np.array(df.drop(['class'],1))
y= np.array(df['class'])

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)

#clf = neighbors.KNeighborsClassifier()
clf = LinearRegression(normalize=True)
clf.fit(X_train, y_train)

accuracy= clf.score(X_test, y_test)
print(accuracy)

example_measures = np.array([[4,2,1,1,1,2,3,2,1],[4,2,1,2,2,2,3,2,1]])
example_measures = example_measures.reshape(1,-1)

prediction = clf.predict(example_measures)     ##(example_measures)

print(prediction)

当我在 Ubuntu 或 Anaconda 上运行上述命令行时出现问题:

ValueError:查询数据维度必须匹配训练数据维度

如何解决这个问题?我确信通过隔离单个命令行的方法 - 并发现它在以下位置出现错误:

prediction = clf.predict(example_measures)

我尝试使用:

prediction = clf.predict(X_test).

没关系。我真的很想预测我创建的示例。如何更改代码?

1个回答

X_train多少列X_test

我想(虽然无法确认,因为我无权访问您的数据)他们的列数少于(或多于)18 列。

这是因为你的代码

example_measures = np.array([[4,2,1,1,1,2,3,2,1],[4,2,1,2,2,2,3,2,1]])
example_measures = example_measures.reshape(1,-1)

产生一个 shape 数组(1, 18)

编辑:我尝试匹配您的数据集,并得到以下信息:

import numpy as np
from sklearn import preprocessing, cross_validation, neighbors
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_breast_cancer

X= load_breast_cancer().data
y= load_breast_cancer().target

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)

clf = LinearRegression(normalize=True)
clf.fit(X_train, y_train)

accuracy= clf.score(X_test, y_test)
print(accuracy)

如果我打电话X.shape,我会得到(569, 30)因此,如果您想制作自己的数组以传递给clf,它需要有 30 列(每个功能一个)。