数据挖掘 - scikit-learn 中的 RandomForest 和树特征重要性 - 吾爱随笔录

scikit-learn 中的 RandomForest 和树特征重要性

数据挖掘 scikit-学习特征选择随机森林

2021-10-03 17:40:37

以下代码中的model.feature_importances_和之间有什么区别：tree.feature_importances_

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Boston Housing dataset
from sklearn.datasets import load_boston
boston = load_boston()
# Convert 'skleran.bunch' to Pandas dataframe
data = pd.DataFrame(boston.data, columns=boston.feature_names)

# Create train and test sets for cross-validation
X,y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 123)

model = RandomForestRegressor()
model.fit(X_train, y_train)

据我了解，以下是功能的重要性：

importance = model.feature_importances_
importance_df = pd.DataFrame(importance, index=X_train.columns, 
                      columns=["Importance"])


        Importance
-----------------
CRIM    0.025993
ZN      0.002781
INDUS   0.004832
CHAS    0.000315
NOX     0.028655
RM      0.406285
AGE     0.017987
DIS     0.040696
RAD     0.003615
TAX     0.009281
PTRATIO 0.009103
B       0.012354
LSTAT   0.438106

什么是tree.feature_importance？：

[tree.feature_importances_ for tree in model.estimators_]

它们有何不同，如何计算以及 0.2 或 0.9 哪个更重要？在文档中找不到。

2个回答

_feature_importance随机森林计算森林中所有树的平均特征重要性。Whiletree.feature_importances_是单个树的特征重要性。

由于特征重要性被计算为特征对最大化分割标准（或等效地：最小化子节点的杂质）的贡献，越高越好。

你可以在源代码中看到它是如何工作的：

随机森林的属性_feature_importance定义如下（完整源代码见这里）：

@property
    def feature_importances_(self):
        """
        Return the feature importances (the higher, the more important the
           feature).
        Returns
        -------
        feature_importances_ : array, shape = [n_features]
            The values of this array sum to 1, unless all trees are single node
            trees consisting of only the root node, in which case it will be an
            array of zeros.
        """
        check_is_fitted(self)

        all_importances = Parallel(n_jobs=self.n_jobs,
                                   **_joblib_parallel_args(prefer='threads'))(
            delayed(getattr)(tree, 'feature_importances_')
            for tree in self.estimators_ if tree.tree_.node_count > 1)

        if not all_importances:
            return np.zeros(self.n_features_, dtype=np.float64)

        all_importances = np.mean(all_importances,
                                  axis=0, dtype=np.float64)
        return all_importances / np.sum(all_importances)

如您所见，它返回森林中所有树木的平均特征重要性，从而使用树类。树类实现如下：

def feature_importances_(self):
        """Return the feature importances.
        The importance of a feature is computed as the (normalized) total
        reduction of the criterion brought by that feature.
        It is also known as the Gini importance.
        Returns
        -------
        feature_importances_ : ndarray of shape (n_features,)
            Normalized total reduction of criteria by feature
            (Gini importance).
        """
        check_is_fitted(self)

        return self.tree_.compute_feature_importances()

这里定义 compute_feature_importances：

cpdef compute_feature_importances(self, normalize=True):
        """Computes the importance of each feature (aka variable)."""
        cdef Node* left
        cdef Node* right
        cdef Node* nodes = self.nodes
        cdef Node* node = nodes
        cdef Node* end_node = node + self.node_count

        cdef double normalizer = 0.

        cdef np.ndarray[np.float64_t, ndim=1] importances
        importances = np.zeros((self.n_features,))
        cdef DOUBLE_t* importance_data = <DOUBLE_t*>importances.data

        with nogil:
            while node != end_node:
                if node.left_child != _TREE_LEAF:
                    # ... and node.right_child != _TREE_LEAF:
                    left = &nodes[node.left_child]
                    right = &nodes[node.right_child]

                    importance_data[node.feature] += (
                        node.weighted_n_node_samples * node.impurity -
                        left.weighted_n_node_samples * left.impurity -
                        right.weighted_n_node_samples * right.impurity)
                node += 1

        importances /= nodes[0].weighted_n_node_samples

        if normalize:
            normalizer = np.sum(importances)

            if normalizer > 0.0:
                # Avoid dividing by zero (e.g., when root is pure)
                importances /= normalizer

        return importances

随机森林模型是决策树的集合。tree.feature_importance_定义了每棵树的特征重要性，但model.feature_importance_它是整个森林的特征重要性。文档给出了计算的解释：

用作树中决策节点的特征的相对等级（即深度）可用于评估该特征相对于目标变量的可预测性的相对重要性。树顶部使用的特征有助于更大部分输入样本的最终预测决策。因此，它们贡献的样本的预期分数可以用作特征相对重要性的估计。在 scikit-learn 中，特征贡献的样本分数与拆分它们所带来的杂质减少相结合，以创建对该特征预测能力的归一化估计。

通过对几棵随机树的预测能力的估计进行平均，可以减少这种估计的方差并将其用于特征选择。这被称为杂质平均减少量或 MDI。有关使用随机森林进行 MDI 和特征重要性评估的更多信息，请参阅 [L2014]。

数字越大，特征越重要

其它你可能感兴趣的问题

上一篇深度学习训练和测试误差之间的差异：偏差-方差权衡和模型选择下一篇有哪些资源可以测试您的数据科学技能？