假设我有一个标记有健康和不健康两个类别的数据集,并且我在数据集上应用了特征选择(特征重要性)。
我如何知道这些功能对特定类别(健康或不健康)是否重要?
假设我有一个标记有健康和不健康两个类别的数据集,并且我在数据集上应用了特征选择(特征重要性)。
我如何知道这些功能对特定类别(健康或不健康)是否重要?
像这样的东西应该让你去。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("https://rodeo-tutorials.s3.amazonaws.com/data/credit-data-trainingset.csv")
df.head()
from sklearn.ensemble import RandomForestClassifier
features = np.array(['revolving_utilization_of_unsecured_lines',
'age', 'number_of_time30-59_days_past_due_not_worse',
'debt_ratio', 'monthly_income','number_of_open_credit_lines_and_loans',
'number_of_times90_days_late', 'number_real_estate_loans_or_lines',
'number_of_time60-89_days_past_due_not_worse', 'number_of_dependents'])
clf = RandomForestClassifier()
clf.fit(df[features], df['serious_dlqin2yrs'])
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()