_feature_importance
随机森林计算森林中所有树的平均特征重要性。Whiletree.feature_importances_
是单个树的特征重要性。
由于特征重要性被计算为特征对最大化分割标准(或等效地:最小化子节点的杂质)的贡献,越高越好。
你可以在源代码中看到它是如何工作的:
随机森林的属性_feature_importance
定义如下(完整源代码见这里):
@property
def feature_importances_(self):
"""
Return the feature importances (the higher, the more important the
feature).
Returns
-------
feature_importances_ : array, shape = [n_features]
The values of this array sum to 1, unless all trees are single node
trees consisting of only the root node, in which case it will be an
array of zeros.
"""
check_is_fitted(self)
all_importances = Parallel(n_jobs=self.n_jobs,
**_joblib_parallel_args(prefer='threads'))(
delayed(getattr)(tree, 'feature_importances_')
for tree in self.estimators_ if tree.tree_.node_count > 1)
if not all_importances:
return np.zeros(self.n_features_, dtype=np.float64)
all_importances = np.mean(all_importances,
axis=0, dtype=np.float64)
return all_importances / np.sum(all_importances)
如您所见,它返回森林中所有树木的平均特征重要性,从而使用树类。树类实现如下:
def feature_importances_(self):
"""Return the feature importances.
The importance of a feature is computed as the (normalized) total
reduction of the criterion brought by that feature.
It is also known as the Gini importance.
Returns
-------
feature_importances_ : ndarray of shape (n_features,)
Normalized total reduction of criteria by feature
(Gini importance).
"""
check_is_fitted(self)
return self.tree_.compute_feature_importances()
这里定义 compute_feature_importances
:
cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable)."""
cdef Node* left
cdef Node* right
cdef Node* nodes = self.nodes
cdef Node* node = nodes
cdef Node* end_node = node + self.node_count
cdef double normalizer = 0.
cdef np.ndarray[np.float64_t, ndim=1] importances
importances = np.zeros((self.n_features,))
cdef DOUBLE_t* importance_data = <DOUBLE_t*>importances.data
with nogil:
while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child]
importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1
importances /= nodes[0].weighted_n_node_samples
if normalize:
normalizer = np.sum(importances)
if normalizer > 0.0:
# Avoid dividing by zero (e.g., when root is pure)
importances /= normalizer
return importances