我读到使用梯度树提升时不需要标准化(参见例如https://stackoverflow.com/q/43359169/1551810和https://github.com/dmlc/xgboost/issues/357)。
而且我认为我理解原则上在提升回归树时不需要规范化。
尽管如此,将 xgboost 用于回归树,我发现缩放目标会对预测结果的(样本内)误差产生重大影响。这是什么原因?
波士顿住房数据集的示例:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
boston = load_boston()
y = boston['target']
X = boston['data']
scales = pd.Index(np.logspace(-6, 6), name='scale')
data = {'reg:linear': [], 'reg:gamma': []}
for objective in ['reg:linear', 'reg:gamma']:
for scale in scales:
xgb_model = xgb.XGBRegressor(objective=objective).fit(X, y / scale)
y_predicted = xgb_model.predict(X) * scale
data[objective].append(mean_squared_error(y, y_predicted))
pd.DataFrame(data, index=scales).plot(loglog=True, grid=True).set(ylabel='MSE')

