数据集摘要:银行贷款(分类)问题
问题总结:
- 我正在探索简化 EDA 过程(探索性数据分析)以找到最佳拟合变量的方法
- 我从 Scikit 包中遇到了 SelectKBest
- 实现很顺利,除了它返回给我的一些变量显然不是一个好因素(比如数据集中的主键)
- 执行中有问题吗?还是包应该以这种方式表现?
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.preprocessing import LabelEncoder
# My internal code to read the data file
from src.api.data import LoadDataStore
# Preping Data
raw = LoadDataStore.get_raw()
x_raw = raw.drop(["default_ind", "issue_d"], axis=1)
y_raw = raw[["default_ind"]].values.ravel()
# NA and Encoding
for num_var in x_raw.select_dtypes(include=[numpy.float64]).columns.values:
x_raw[num_var] = x_raw[num_var].fillna(-1)
encoder = LabelEncoder()
for cat_var in x_raw.select_dtypes(include=[numpy.object]).columns.values:
x_raw[cat_var] = x_raw[cat_var].fillna("NA")
x_raw[cat_var] = encoder.fit_transform(x_raw[cat_var])
# Main Part of this problem
test = SelectKBest(score_func=f_classif, k=15)
fit = test.fit(x_raw, y_raw)
ok_var = []
not_var = []
for flag, var in zip(fit.get_support(), x_raw.columns.values):
if flag:
ok_var.append(var)
else:
not_var.append(var)
ok_var
['id', 'member_id', 'int_rate', 'grade', 'sub_grade', 'desc', 'title', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'next_pymnt_d']
not_var
['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'installment', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'last_pymnt_amnt', 'last_credit_pull_d', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m', 'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m']
它很清楚id,member_id不应该属于最佳功能列表!知道我做错了什么吗?
编辑:做了更多的挖掘,@Icrmorin的回复是正确的。(它是一个 kaggle 数据集,所以不知道为什么)但这里是箱线图id
