语境:
我正在建立一个模型来预测来自 NYPD 数据的犯罪类型(7 类)。
features = ['occurrence_hour', 'borough_labels', 'time_to_entry']
X_train, y_train = train[features], train['offense_labels']
X_test, y_test = test[features], test['offense_labels']
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
0.437
使用这些简单的特征,我们达到了 43.7% 的准确率。现在,如果我们添加一个星期几的特征,准确率会下降到 38.1%(另外:40.6% 的条目属于“大盗窃”类别,所以我们可以通过猜测“大盗窃”来达到 40.6% 的准确率)。
features = ['occurrence_hour', 'borough_labels', 'time_to_entry', 'day_of_week']
X_train, y_train = train[features], train['offense_labels']
X_test, y_test = test[features], test['offense_labels']
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
0.381
那么我的问题是:添加信息怎么可能降低决策树的准确性?这是否应该仅用于提高我们的预测能力?