假设我是一名医生,我想知道哪些变量对预测乳腺癌最重要(二元分类)。两位不同的科学家分别向我展示了不同的特征重要性数字......
使用 L2 范数进行逻辑回归(模型系数的绝对值;显示 10 个最高值):

结果非常不同。我应该相信哪个科学家?这些数字中的一个/两个都没有意义吗?
代码如下;在 scikit-learn 中使用 Wisconsin Breast Cancer 数据集。
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import matplotlib.pyplot as plt
data = load_breast_cancer()
y = data.target
X = data.data
clf = LogisticRegressionCV(max_iter=3000)
clf.fit(X, y)
coefs = np.abs(clf.coef_[0])
indices = np.argsort(coefs)[::-1]
plt.figure()
plt.title("Feature importances (Logistic Regression)")
plt.bar(range(10), coefs[indices[:10]],
color="r", align="center")
plt.xticks(range(10), data.feature_names[indices[:10]], rotation=45, ha='right')
plt.subplots_adjust(bottom=0.3)
clf = RandomForestClassifier(n_jobs=-1, random_state=42, n_estimators=400, max_depth=6, max_features=6) #has already been tuned
clf.fit(X, y)
coefs = clf.feature_importances_
indices = np.argsort(coefs)[::-1]
plt.figure()
plt.title("Feature importances (Random Forests)")
plt.bar(range(10), coefs[indices[:10]],
color="r", align="center")
plt.xticks(range(10), data.feature_names[indices[:10]], rotation=45, ha='right')
plt.subplots_adjust(bottom=0.3)
plt.ion(); plt.show()
