我有一个二进制分类问题,我正在使用 Scikit 的 RandomForestClassifier 解决这个问题。当我绘制(到目前为止)最重要的特征(如箱线图)以查看其中是否有异常值时,我发现了许多异常值。所以我试图从数据集中删除它们。
准确率和交叉验证下降了大约 5%。我有 80% 的准确率和 0.8 的 Cross-Val-Score
从 3 个最重要的特征(RF 的特征重要性)中去除异常值后,准确率和 Cross-Val-Score 分别下降到 76% 和 77%。
这是我的数据集描述的一部分:
这是准确性和 Cross-Val-Score:
Accuracy score: 0.808388941849
Average Cross-Val-Score: 0.80710845698
这是我删除异常值的方法:
clean_model = basic_df.copy()
print('Clean model shape (before clearing out outliers): ', clean_model.shape)
# Drop 'num_likes' outliers
clean_model.drop(clean_model[clean_model.num_likes > (1938 + (1.5* (1938-125)))].index, inplace=True)
print('Clean model shape (after clearing out "num_likes" outliers): ', clean_model.shape)
# Drop 'num_shares' outliers
clean_model.drop(clean_model[clean_model.num_shares > (102 + (1.5* (102-6)))].index, inplace=True)
print('Clean model shape (after clearing out "num_shares" outliers): ', clean_model.shape)
# Drop 'num_comments' outliers
clean_model.drop(clean_model[clean_model.num_comments > (54 + (1.5* (54-6)))].index, inplace=True)
print('Clean model shape (after clearing out "num_comments" outliers): ', clean_model.shape)
以下是去除异常值后的形状:
Clean model shape (before clearing out outliers): (6992, 20)
Clean model shape (after clearing out "num_likes" outliers): (6282, 20)
Clean model shape (after clearing out "num_shares" outliers): (6024, 20)
Clean model shape (after clearing out "num_comments" outliers): (5744, 20)
这是删除异常值后的箱线图(仍然有异常值..如果我也删除这些,我将只有很少的数据点):

这是去除异常值并使用相同模型后的准确率和 Cross-Val-Score:
Accuracy score: 0.767981438515
Average Cross-Val-Score: 0.779092230906
为什么去除异常值会降低准确性和 F1 分数?我应该把它们留在数据集中吗?或者删除要在第二个箱线图中看到的异常值(在删除第一个异常值之后,如上所示)?
这是我的模型:
model= RandomForestClassifier(n_estimators=120, criterion='entropy',
max_depth=7, min_samples_split=2,
#max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_features=8, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
bootstrap=True, oob_score=False, n_jobs=1,
verbose=0, warm_start=False,
class_weight=None,
random_state=23)
model.fit(x_train, y_train)
print('Accuracy score: ', model.score(x_test,y_test))
print('Average Cross-Validation-Score: ', np.mean(cross_val_score(model, x_train, y_train, cv=5))) # 5-Fold Cross validation




