数据挖掘 - 提高二元分类器的精度 - Python 中的决策树 - 吾爱随笔录

目前，我正在做一个项目。数据集大致以 50:50 的比例平衡。我创建了一个决策树分类器。我在验证数据上实现了不错的准确度（~75%），但目标变量的准确度存在偏差。对于 class=0，它大约是。98%，而对于 class = 1，只有 17%。

我尝试使用 MinMaxScaler 缩放数据仍然没有运气。

model = tree.DecisionTreeClassifier(class_weight={1:30}, min_samples_leaf=160, max_depth=10)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=10)

min_max_scaler = preprocessing.MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_test_scaled = min_max_scaler.fit_transform(X_test)

model = model.fit(X_train_scaled, y_train)

prediction = model.predict(X_test_scaled)

print metrics.accuracy_score(y_test, prediction)
print classification_report(y_test, prediction)

Size of x_train_scaled = 12600 and x_test_scaled = 5400
Accuracy: 75%
Precision: {0:100%, 1:17%}
Recall: {0:74%, 1:100%}
F1-Score: {0:85%, 1:29%}

如何在保持整体精度和准确度的同时提高 class=1 的精度？

model = clf.fit(df[features], df[label]) df["proba"] = model.predict_proba(df[features])[:,1] threshold = 0.4 # You can play on this value (default is 0.5) df["pred"] = df["proba"].apply(lambda el: 1.0 if el >= threshold else 0.0)