LightGBM 与 Sklearn LightGBM- 实施中的错误- 完全相同的参数给出不同的结果

数据挖掘 机器学习 Python scikit-学习 lightgbm
2022-02-28 19:43:58

在将完全相同的参数传递给 LightGBM 和 sklearn 的 LightGBM 实现时,我得到了不同的结果。最初,我在这样做时得到了完全相同的结果,但是,我对我的代码进行了一些更改,现在我不知道为什么它们不一样了。这意味着性能指标和功能重要性的出现不同。请帮我弄清楚,我无法弄清楚我所犯的错误。这可能是我使用原始库实现 LightGBM 的方式或 sklearn 的实现中的错误。链接解释为什么我们应该得到相同的结果。

x_train, x_test, y_train, y_test = train_test_split(df_dummy[df_merge.columns], labels, test_size=0.25,random_state=42)

n_folds = 5

lgb_train = lgb.Dataset(x_train, y_train)

def objective(params, n_folds = n_folds):
    """Objective function for Gradient Boosting Machine Hyperparameter Tuning"""

    print(params)

    params['max_depth'] = int(params['max_depth'])
    params['num_leaves'] = int(params['num_leaves'])

    params['min_child_samples'] = int(params['min_child_samples'])
    params['subsample_freq'] = int(params['subsample_freq'])

    # Perform n_fold cross validation with hyperparameters

    # Use early stopping and evalute based on ROC AUC
    cv_results = lgb.cv(params, lgb_train, nfold=n_folds, num_boost_round=10000, 
                        early_stopping_rounds=100, metrics='auc')

    # Extract the best score
    best_score = max(cv_results['auc-mean'])

    # Loss must be minimized
    loss = 1 - best_score
    num_iteration = int(np.argmax(cv_results['auc-mean']) + 1)

    of_connection = open(out_file, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([loss, params, num_iteration])

    # Dictionary with information for evaluation
    return {'loss': loss, 'params': params, 'status': STATUS_OK, 'estimators': num_iteration}

space = {
    'min_child_samples': hp.quniform('min_child_samples', 5, 100, 5), 
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1.0),
    'max_depth' : hp.quniform('max_depth', 3, 10, 1),
    'subsample' : hp.quniform('subsample', 0.6, 1, 0.05),
    'num_leaves': hp.quniform('num_leaves', 20, 150, 1),  
    'subsample_freq': hp.quniform('subsample_freq',0,10,1),
    'min_gain_to_split': hp.quniform('min_gain_to_split', 0.01, 0.1, 0.01),


    'learning_rate' : 0.05,
    'objective' : 'binary',

}

out_file = 'results/gbm_trials.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

writer.writerow(['loss', 'params', 'estimators'])
of_connection.close()

trials = Trials()
best = fmin(objective, space, algo=tpe.suggest, trials=trials, max_evals=10)
bayes_trials_results = sorted(trials.results, key = lambda x: x['loss'])

results = pd.read_csv('results/gbm_trials.csv')

# Sort with best scores on top and reset index for slicing
results.sort_values('loss', ascending = True, inplace = True)
results.reset_index(inplace = True, drop = True)
results.head()
best_bayes_estimators = int(results.loc[0, 'estimators'])

best['max_depth'] = int(best['max_depth'])
best['num_leaves'] = int(best['num_leaves'])

best['min_child_samples'] = int(best['min_child_samples'])

num_boost_round=int(best_bayes_estimators * 1.1)
best['objective'] = 'binary'
best['boosting_type'] = 'gbdt'

best['subsample_freq'] = int(best['subsample_freq'])

#Actual LightGBM

best_gbm = lgb.train(params=best, train_set=lgb_train, num_boost_round=num_boost_round)
    
#Sklearn's Implementation of LightGBM

best_sk = dict(best)
del best_sk['min_gain_to_split']
sk_best_gbm = lgb.LGBMClassifier(**best_sk, n_estimators=num_boost_round, learning_rate=0.05, min_split_gain=best['min_gain_to_split'])
sk_best_gbm.fit(x_train, y_train)

sk_best_gbm.get_params()
1个回答

来自 Github 的好心人能够回答这个问题

答案是这样的:对于参数一致性问题,可以尝试在lgb.train之前使用一个新的lgb_train,比如

lgb_train = lgb.Dataset(x_train, y_train)
best_gbm = lgb.train(params=best, train_set=lgb_train, num_boost_round=num_boost_round)

lgb_train 是惰性初始化的,并且只初始化一次,所以会在 cv 部分构建。并且该部分中的某些参数(如 min_child_samples)可能会改变 lgb_train。因此,lgb_train 可能由不同的参数启动。(所以最好在 cv 部分也使用新的 lgb.train 。)