数据挖掘 - 决策树分类器：可能的过拟合 - 吾爱随笔录

我有一个具有以下规格的数据集：

包含 52968 个样本和 8562 个正例的训练数据集
包含 13242 个样本和 2135 个阳性样本的测试数据集
共有 137 个功能

我想执行二进制分类。我在管道中创建 DecisionTreeClassificator：

imp = Imputer(strategy="most_frequent", axis=0)
var_thr = VarianceThreshold(threshold=1.7)
pca = RandomizedPCA(n_components=16)
clf = DecisionTreeClassifier(max_features=0.86, max_depth=42)

return Pipeline(steps=[('imp', imp),
                       ('var_thr', var_thr),
                       ('pca', pca),
                       ('clf', clf)
])

我还尝试增加具有积极结果的训练数据：

series = y_train[y_train==1]
dupli = x_train.loc[series.index.tolist(), :]
for _ in range(5):
    x_train = x_train.append(dupli)
    y_train = y_train.append(series)

return x_train, y_train

拟合我的模型后，我的测试数据的得分结果是 0.9954，交叉验证是：

cross_val_score(clf, x_train, y_train, cv=5)
[ 0.90225866  0.90638078  0.90592215  0.90007453  0.90632345]

训练数据的分类报告很完美：

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     44406
          1       1.00      1.00      1.00     42810

avg / total       1.00      1.00      1.00     87216

混淆矩阵为：

[[44203   203]
 [  190 42620]]

但测试数据差很多：

             precision    recall  f1-score   support

          0       0.85      0.85      0.85     11107
          1       0.21      0.21      0.21      2135

avg / total       0.75      0.75      0.75     13242

混淆矩阵为：

[[9428 1679]
 [1687  448]]

我使用 GridSearchCV 作为阈值、n_components、max_features 和 max_depth。如何改进我的模型并获得更好的预测？

编辑 -------> 我在管道中更改了 clf。我使用了 RandomForestClassifier。

clf = RandomForestClassifier(
    n_estimators=500, n_jobs=-1, max_features=0.5, max_depth=15, 
    random_state=1
)

现在交叉验证是

[ 0.81552396  0.81218827  0.82331021  0.81488276  0.81769191]

带有混淆矩阵的训练数据的分类报告：

score train result: 0.8514148780040359
             precision    recall  f1-score   support

          0       0.86      0.85      0.85     44406
          1       0.85      0.85      0.85     42810

avg / total       0.85      0.85      0.85     87216

[[37757  6649]
 [ 6310 36500]]

带有混淆矩阵的测试数据的分类报告：

score test result: 0.7341791270200876
             precision    recall  f1-score   support

          0       0.89      0.79      0.83     11107
          1       0.30      0.47      0.36      2135

avg / total       0.79      0.73      0.76     13242

[[8719 2388]
 [1132 1003]]

它看起来更好，但我搜索具有 + 0.90 召回率的模型来训练和测试数据集。