数据挖掘 - 具有文本特征的特征重要性 - 吾爱随笔录

具有文本特征的特征重要性

数据挖掘机器学习 Python scikit-学习特征选择特征重要性

2022-02-17 22:24:56

我想确定几个模型中的特征重要性：

支持向量机
逻辑回归
朴素贝叶斯
随机森林

我读到我需要一个不可知论模型，所以我想使用 performance_importance（在 python 中）。我的特征看起来像

文本（例如，笔在桌子上，天空是蓝色的，......）
年份（例如，2019、2020、...）
#_of_characters（例如，34、67、...）：此值来自 Text
政党（例如，国家、地方、绿色……）
Over18 (eg, 1, 0, ...) : 这是一个布尔变量

我的目标变量是Voted.

在预处理阶段，我对文本使用 BoW 和 TF-IDF，对派对使用 OneHotEncoder，对数值使用 SimpleImputer。使用以下内容：

from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt

result = permutation_importance(clf, X_test, y_test, n_repeats=5, random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()

plt.boxplot(result.importances[sorted_idx].T,
            vert=False, labels=X.columns[sorted_idx]);

我得到了类似下面的输出（我忘了包括 Over18，但这只是为了给出输出的想法）：

虽然我在解释结果时遇到困难，尤其是圆圈和负值，但我想了解，在文本分类的情况下，是否有意义，Text而不是单个单词（例如，unigrams、bigrams、 ...）。例如，在我的示例中，我有['The','pen','is','on','table','sky','blue']. 理解每个单词对模型的贡献而不是 Text 是否更有意义，或者这只是在 Text 中考虑（其中有很多单词对模型有贡献），这是模型中最重要的特征？

更新：对于不同的功能，我使用以下预处理器：

categorical_preprocessing = OneHotEncoder(handle_unknown='ignore')

numeric_preprocessing = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')
])

# CountVectorizer
text_preprocessing_cv =  Pipeline(steps=[
    ('CV',CountVectorizer())
]) 

# TF-IDF
text_preprocessing_tfidf = Pipeline(steps=[
    ('TF-IDF',TfidfVectorizer())       
])

进而

preprocessing_cv = ColumnTransformer(
    transformers=[
        ('text',text_preprocessing_cv, 'Text'),
        ('category', categorical_preprocessing, categorical_features),
        ('numeric', numeric_preprocessing, numerical_features)
], remainder='passthrough')

clf_nb = Pipeline(steps=[('preprocessor', preprocessing_cv),
                      ('classifier', MultinomialNB())])

1个回答

permutation_importance正在考虑顶级特征。它是按顺序排列每一个并了解其重要性。
因此，内部编码（即 OHE/tfid ）对其不可见。

要获得顶级特征的组件的重要性，您应该单独对其进行编码，然后将编码后的数据传递给permutation_importance

使用获取预处理数据 preprocessing_cv.fit_transform(X_train)
在上述数据和您选择的任何模型上调用您的代码permutation_importance

编辑

添加片段。我排除了 ColumnTransformer，因为它会导致一些问题。

data = {'Number':[1,2,3], 'Text':['pen is table', 'sky is blue','Sun is kool'], 'Cat':['A','B', 'C']}
df = pd.DataFrame(data)

categorical_preprocessing = OneHotEncoder(handle_unknown='ignore')
numeric_preprocessing = SimpleImputer(strategy='mean')
text_preprocessing_cv =  CountVectorizer()

text_tfid = text_preprocessing_cv.fit_transform(df['Text']).toarray()
num = numeric_preprocessing.fit_transform(df['Number'].values.reshape(-1, 1))
cat = categorical_preprocessing.fit_transform(df['Cat'].values.reshape(-1, 1)).toarray()

data = np.concatenate((cat,num,text_tfid), axis=1)
cols =  np.concatenate((categorical_preprocessing.get_feature_names(), text_preprocessing_cv.get_feature_names(), ['Num'])) # New cols name

df = pd.DataFrame(data, columns=cols) # Encoded DataFrame with col name

from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier().fit(df, [1,0,1])

result = permutation_importance(clf, df, [1,0,1], n_repeats=2, random_state=42)
sorted_idx = result.importances_mean.argsort()

plt.boxplot(result.importances[sorted_idx].T,
            vert=False, labels=df.columns[sorted_idx]);

其它你可能感兴趣的问题

上一篇100x100阵列的慢keras拟合方法，我怎样才能让它更快？下一篇bootstrap 对隔离森林的影响