数据挖掘 - 如何确定哪些功能最重要？ - 吾爱随笔录

如何确定哪些功能最重要？

数据挖掘机器学习 scikit-学习特征选择机器学习模型

2022-02-18 07:47:03

我有一个包含贷款搜索结果的大型数据集。有人会输入他们的详细信息，例如收入等，结果将包括来自不同公司和不同贷款类型的一堆贷款（因此每家公司可以有超过 1 笔贷款）。

数据集由每个唯一的搜索和所有相应的结果组成。我还有一列显示用户每次搜索最后选择了哪些贷款。我正在寻找贷款的哪些功能对用户最重要，即尝试根据他/她的输入来预测用户将选择什么贷款。

我可以为此使用什么 ML 模型？我不确定如何解决这个问题。

3个回答

我在这里看到了几个很好的答案！对于这样的事情，我倾向于主成分分析（下面的示例代码）和特征选择（下面的示例代码）。我们不要将特征选择与特征工程（数据清洗和预处理、One-Hot-Encoding、缩放、标准化、规范化等）混为一谈

主成分分析：PCA 是一种特征提取技术——因此它以特定方式组合我们的输入变量，然后我们可以删除“最不重要”的变量，同时仍保留所有变量中最有价值的部分！另外一个好处是，PCA 之后的每个“新”变量都是相互独立的。这是一个好处，因为线性模型的假设要求我们的自变量相互独立。

这是主成分分析如何工作的一个很好的例子。

import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"# load dataset into Pandas DataFrame
df = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','target'])

from sklearn.preprocessing import StandardScaler
features = ['sepal length', 'sepal width', 'petal length', 'petal width']# Separating out the features
x = df.loc[:, features].values# Separating out the target
y = df.loc[:,['target']].values# Standardizing the features
x = StandardScaler().fit_transform(x)


from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2'])


finalDf = pd.concat([principalDf, df[['target']]], axis = 1)
finalDf

结果：

     principal component 1  principal component 2          target
0                -2.264542               0.505704     Iris-setosa
1                -2.086426              -0.655405     Iris-setosa
2                -2.367950              -0.318477     Iris-setosa
3                -2.304197              -0.575368     Iris-setosa
4                -2.388777               0.674767     Iris-setosa
..                     ...                    ...             ...
145               1.870522               0.382822  Iris-virginica
146               1.558492              -0.905314  Iris-virginica
147               1.520845               0.266795  Iris-virginica
148               1.376391               1.016362  Iris-virginica
149               0.959299              -0.022284  Iris-virginica

继续...

# visualize results
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)

targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
    indicesToKeep = finalDf['target'] == target
    ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
               , finalDf.loc[indicesToKeep, 'principal component 2']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

参考：

使用 Python 的 PCA (scikit-learn) | 迈向数据科学

特征选择：数据集的特征仅仅意味着一列。当我们获得任何数据集时，不一定每一列（特征）都会对输出变量产生影响。如果我们在模型中添加这些不相关的特征，只会让模型变得最差（Garbage In Garbage Out）。这就产生了进行特征选择的需要。

对于特征选择练习，我非常喜欢这个例子。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# %matplotlib inline


df = pd.read_csv("https://rodeo-tutorials.s3.amazonaws.com/data/credit-data-trainingset.csv")
df.head()


from sklearn.ensemble import RandomForestClassifier

features = np.array(['revolving_utilization_of_unsecured_lines',
                     'age', 'number_of_time30-59_days_past_due_not_worse',
                     'debt_ratio', 'monthly_income','number_of_open_credit_lines_and_loans', 
                     'number_of_times90_days_late', 'number_real_estate_loans_or_lines',
                     'number_of_time60-89_days_past_due_not_worse', 'number_of_dependents'])
clf = RandomForestClassifier()
clf.fit(df[features], df['serious_dlqin2yrs'])

# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)


padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()

参考：

http://blog.yhat.com/tutorials/5-Feature-Engineering.html

一种常见的方法是使用主成分分析 (PCA) 并删除方差较小的方向。例如见这里：

使用主成分分析 (PCA) 进行特征选择

最新版本的 sklearn 允许使用所谓的置换重要性来估计任何估计器的特征重要性：

排列特征重要性

sklearn 中的随机森林还实现了其他方法来估计特征相关性：

森林中的特征重要性

清理数据并检查每个变量如何随输出变化。删除输出变量中方差较小的变量。

sklearn.feature_selection 包含SelectKBest、chi2、mutual_info_classif等多种方法来选择最佳特征。

https://scikit-learn.org/stable/modules/feature_selection.html

使用 PCA、前向阶段明智选择方法来获得与输出高度相关的变量。或者建立一个随机森林模型，得到每个特征的特征重要性值。保留具有高值的变量并丢弃剩余的变量。

其它你可能感兴趣的问题

上一篇LabelEncoding 使用 for 循环在 Dataframe 中选择的列下一篇如何开始理解音频和音乐分析