使用带有 XGBoost 的 scorecardpy 进行信用评分

数据挖掘 机器学习 Python 决策树 xgboost 计分
2021-09-17 16:50:45

我使用 XGBoost 对信誉进行评分。起初我以为我可以使用 predict_proba 进行评分,但后来我看到有一个基于 WOE 的模块 scorecardpy 来计算代码评分。我尝试将它与我的 XGBoost 一起使用,但我的 ROC AUC 下降到 0.5,我看不出我做错了什么。谢谢你的帮助。

data = pd.read_csv('data.csv')


train_index = data['date'] < '2018-04-01'
test_index = data['date'] >= '2018-04-01'

data_final = data.drop('date', axis=1)

df_train = data_final[train_index]
df_test = data_final[test_index]

data_final_vars = data_final.columns.values.tolist()
y=['label']
X=[i for i in data_final_vars if i not in y]


# woe binning ------
bins = sc.woebin(data_final, y="label")
sc.woebin_plot(bins)

# binning adjustment
# # adjust breaks interactively
# breaks_adj = sc.woebin_adj(dt_s, "creditability", bins) 
# # or specify breaks manually
breaks_adj = {
    'age': [26, 35, 40, 50, 60]
}
bins_adj = sc.woebin(data_final, y="label", breaks_list=breaks_adj)

# converting train and test into woe values
train_woe = sc.woebin_ply(df_train, bins_adj)
test_woe = sc.woebin_ply(df_test, bins_adj)


ytrain = train_woe.loc[:,'label']
xtrain = train_woe.loc[:, train_woe.columns != 'label']
ytest = test_woe.loc[:,'label']
xtest = test_woe.loc[:, test_woe.columns != 'label']

print("shape of xtrain: {}".format(xtrain.shape))
print("shape of xtrain: {}".format(xtest.shape))

from xgboost import XGBClassifier

XGB = XGBClassifier(n_estimators=100, n_jobs=6, verbose=1)
# List the default parameters.
print(XGB.get_xgb_params())

# Train and evaluate 
XGB.fit(xtrain, ytrain, eval_metric=['rmse'], eval_set=[((xtrain, ytrain)),(xtest, ytest)])


# # Classifier

from sklearn.metrics import roc_auc_score

probs = XGB.predict_proba(xtest)
roc = roc_auc_score(y_true=ytest, y_score=probs[:, 1])
print("RF roc score: {}".format(roc))


from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(ytest, probs[:,1])
plt.figure()
plt.plot(fpr, tpr, label='XGBoost Classifier (area = %0.2f)' % roc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('XGB_ROC')


from sklearn import model_selection
from sklearn.model_selection import cross_val_score
kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = XGB
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, xtrain, ytrain, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: {}".format(results.mean()))


# score ------
card = sc.scorecard(bins_adj, XGB, xtrain.columns)
# credit score
train_score = sc.scorecard_ply(df_train, card, print_step=0)
test_score = sc.scorecard_ply(df_test, card, print_step=0)

# psi
sc.perf_psi(
  score = {'train':train_score, 'test':test_score},
  label = {'train':y_train, 'test':y_test}
)
1个回答

它也发生在我身上,虽然我使用了逻辑回归模型而不是 XGBoost。

问题不在于选择哪种型号,而在于woebin_ply功能有问题。我没有阅读源代码,但我得到的 woe 值与相应 bin/input 值的值不匹配(您也可以仔细检查您的结果)。

在手动将输入值与 bin 与相应的 woe 值匹配后,我的记分卡模型的性能与我的基准模型相似。

希望这有帮助!