数据挖掘 - 使用 scikit-learn 找到机器学习分类任务的有效特征 - 吾爱随笔录

使用 scikit-learn 找到机器学习分类任务的有效特征

数据挖掘机器学习 Python 分类 scikit-学习

2021-10-01 20:35:11

我正在使用 python scikit-learn 中实现的 SVM 处理二进制分类任务。数据大小约为 10,000，特征数为 34。

在找到好的参数集（使用RandomizedSearchCV类）后，我通过交叉验证来评估模型。结果看起来不错。

criteria_list = ["precision", "recall", "f1", "roc_auc"]
score_df = []
score_df2 = []
clf = svm.SVC(**random_search_clf.best_estimator_.get_params())
for crit in criteria_list:
    scores = cross_validation.cross_val_score(clf, X, y, cv=3, scoring=crit)
    score_df.append(["{} (±{})".format(np.round(np.mean(scores),3), np.round(np.std(scores),4)), scores])
    score_df2.append(["{} (±{})".format(np.round(np.mean(scores),3), np.round(np.std(scores),4))])

pd.DataFrame(np.transpose(score_df2), columns=criteria_list, index=["SVM"])

我的问题是是否有可能找出哪个特征对测试数据分类有效。我认为这与敏感性分析有关，但通过谷歌搜索“敏感性分析 + svm”或“敏感性分析 + scikit learn”无法显示出好的答案。

3个回答

上面 stmax 的好建议的示例代码，修改为使用 RandomForest 并匹配问题样本大小和特征数量，我希望对您有所帮助：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

X, y = make_classification(n_samples=10000,
                           n_features=34,
                           n_informative=10,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=2,
                           random_state=0,
                           shuffle=False)

forest = RandomForestClassifier(n_estimators=250,
                              random_state=0)

forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure(figsize=(20,10))
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="g", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices,rotation=60)
plt.xlim([-1, X.shape[1]])
plt.show()

数据中特征的有效性取决于该特征的“信息增益”。信息增益越多，分类的特征就越好。我不确定 SVM 是否支持这种技术来评估特征，但您可以寻找决策树分类方法。它计算特征的熵，然后有助于计算信息增益。从这些计算中，您可以轻松找出哪个特征对测试数据进行分类有效。

您正在寻找模型自省功能，换句话说，使您的模型可解释。有很多技术可以做到这一点（有关背景信息，请参阅C. Molnar 的这本书），其中许多是在 scikit-learn 中实现的。我将从排列重要性开始，它将通过使给定的特征变得无意义来估计损失了多少预测能力。

其它你可能感兴趣的问题

上一篇使用 Tensorflow 进行文字标注下一篇基准测试 Theano