交叉验证的训练数据集准确率高,但测试数据集准确率低

数据挖掘 scikit-学习 准确性
2022-02-18 09:57:45

我正在使用来自UCI的成人数据集,并尝试根据工作类别、education.num、种族、性别和收入来预测一个人的祖国。我正在使用 sklearn 和KNeighborsClassifier.

有了这个,预测测试拆分我得到 0.917252 的准确度,K = 15。但是,当我使用测试数据集(在 UCI 链接中给出)时,准确度低于 1%。

我哪里错了?

train = pd.read_csv('datasets/adult_data/adult.csv')
test = pd.read_csv('datasets/adult_data/adulttest.csv')

# Cleaning null values
train = train[train["workclass"] != " ?"]
train = train[train["occupation"] != " ?"]
train = train[train["native.country"] != " ?"]
test = test[test["workclass"] != " ?"]
test = test[test["occupation"] != " ?"]
test = test[test["native.country"] != " ?"]

category_col =['workclass', 'race', 'education','marital.status', 'occupation',
               'relationship', 'gender', 'native.country', 'income']
for col in category_col:
    b, c = np.unique(train[col], return_inverse=True) 
    train[col] = c

for col in category_col:
    b, c = np.unique(test[col], return_inverse=True) 
    test[col] = c

features = train.drop('native.country',axis=1)
target = train["native.country"]
features_test = test.drop('native.country', axis=1)
features_target = test["native.country"]

features_name = ['workclass', 'education.num', 'race', 'gender', 'income']

features = features[features_name]
features_test = features_test[features_name]

#spliting data into train and test data
X_train,X_test,y_train,y_test = train_test_split(features,target,random_state = 12)

k_values = np.arange(1,26)
scores = []

for i in k_values:
    clf = KNeighborsClassifier(n_neighbors=i)
    clf.fit(X_train,y_train)

    y_predict = clf.predict(features_test)
    # y_predict = clf.predict(X_train)

    scores.append(metrics.accuracy_score(features_target, y_predict))
    # scores.append(metrics.accuracy_score(y_train, y_predict))

print("Accuracy for {} is {}".format(np.argmax(scores),max(scores)))

plt.plot(np.arange(1,26),scores)
plt.title('Variation of accuracy with K value, with all features')
plt.xlabel('K values')
plt.ylabel('Accuracy')

plt.show()
1个回答

我认为您在循环中使用注释行来计算交叉验证的准确性。

# y_predict = clf.predict(X_train)

它应该是

 y_predict = clf.predict(X_test)

然后你应该用'y_test'而不是'y_train'检查准确度得分。

您正在预测 KNN 学习的数据。而且由于 KNN 算法可以学习数据。你得到了很好的准确性。