数据挖掘 - 使用 scikit-learn 在随机森林中使用特征重要性进行特征选择 - 吾爱随笔录

使用 scikit-learn 在随机森林中使用特征重要性进行特征选择

数据挖掘特征选择随机森林 scikit-学习

2021-10-02 03:23:46

我用 scikit-learn 绘制了随机森林中的特征重要性。为了改进使用随机森林的预测，如何使用绘图信息来删除特征？即如何根据绘图信息发现一个特征是否无用或更严重地降低了随机森林的性能？该图基于属性feature_importances_，我使用分类器sklearn.ensemble.RandomForestClassifier。

我知道还有其他用于特征选择的技术，但在这个问题中，我想关注如何使用特征feature_importances_。

此类特征重要性图的示例：

1个回答

您可以简单地使用该feature_importances_属性来选择具有最高重要性分数的特征。因此，例如，您可以使用以下函数根据重要性选择 K 个最佳特征。

def selectKImportance(model, X, k=5):
     return X[:,model.feature_importances_.argsort()[::-1][:k]]

或者，如果您使用的是管道，则以下类

class ImportanceSelect(BaseEstimator, TransformerMixin):
    def __init__(self, model, n=1):
         self.model = model
         self.n = n
    def fit(self, *args, **kwargs):
         self.model.fit(*args, **kwargs)
         return self
    def transform(self, X):
         return X[:,self.model.feature_importances_.argsort()[::-1][:self.n]]

例如：

>>> from sklearn.datasets import load_iris
>>> from sklearn.ensemble import RandomForestClassifier
>>> iris = load_iris()
>>> X = iris.data
>>> y = iris.target
>>> 
>>> model = RandomForestClassifier()
>>> model.fit(X,y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
>>> 
>>> newX = selectKImportance(model,X,2)
>>> newX.shape
(150, 2)
>>> X.shape
(150, 4)

显然，如果您想根据“top k features”以外的其他标准进行选择，那么您可以相应地调整功能。

其它你可能感兴趣的问题

上一篇pandas 数据框相对于常规关系数据库的优势下一篇seaborn barplot中“色调”的含义