我正在使用来自UCI的成人数据集,并尝试根据工作类别、education.num、种族、性别和收入来预测一个人的祖国。我正在使用 sklearn 和KNeighborsClassifier.
有了这个,预测测试拆分我得到 0.917252 的准确度,K = 15。但是,当我使用测试数据集(在 UCI 链接中给出)时,准确度低于 1%。
我哪里错了?
train = pd.read_csv('datasets/adult_data/adult.csv')
test = pd.read_csv('datasets/adult_data/adulttest.csv')
# Cleaning null values
train = train[train["workclass"] != " ?"]
train = train[train["occupation"] != " ?"]
train = train[train["native.country"] != " ?"]
test = test[test["workclass"] != " ?"]
test = test[test["occupation"] != " ?"]
test = test[test["native.country"] != " ?"]
category_col =['workclass', 'race', 'education','marital.status', 'occupation',
'relationship', 'gender', 'native.country', 'income']
for col in category_col:
b, c = np.unique(train[col], return_inverse=True)
train[col] = c
for col in category_col:
b, c = np.unique(test[col], return_inverse=True)
test[col] = c
features = train.drop('native.country',axis=1)
target = train["native.country"]
features_test = test.drop('native.country', axis=1)
features_target = test["native.country"]
features_name = ['workclass', 'education.num', 'race', 'gender', 'income']
features = features[features_name]
features_test = features_test[features_name]
#spliting data into train and test data
X_train,X_test,y_train,y_test = train_test_split(features,target,random_state = 12)
k_values = np.arange(1,26)
scores = []
for i in k_values:
clf = KNeighborsClassifier(n_neighbors=i)
clf.fit(X_train,y_train)
y_predict = clf.predict(features_test)
# y_predict = clf.predict(X_train)
scores.append(metrics.accuracy_score(features_target, y_predict))
# scores.append(metrics.accuracy_score(y_train, y_predict))
print("Accuracy for {} is {}".format(np.argmax(scores),max(scores)))
plt.plot(np.arange(1,26),scores)
plt.title('Variation of accuracy with K value, with all features')
plt.xlabel('K values')
plt.ylabel('Accuracy')
plt.show()