我注意到 R 包gbm和 Python 的scikit-learn.
这是gbm加利福尼亚住房数据集的中值对收入中值的部分依赖:
不难看出,R 的部分依赖范围为 1.5 到 4.5,而 的部分依赖范围为scikit-learn-0.5 到 1.5,但线条的形状几乎相同。我不明白为什么会这样。
相关代码:
R
library(oem)
library(gbm)
data(calHousing)
X <- calHousing[ ,!(colnames(calHousing) == "medianValue")]
y <- calHousing$medianValue / 100000
gbm.model <- gbm.fit(X, y, distribution="gaussian", n.trees=100, interaction.depth=12, shrinkage=0.15)
plot(gbm.model, i.var="medianIncome")
Python 代码是来自scikit-learn“示例页面”的复制粘贴。
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.ensemble.partial_dependence import partial_dependence
from sklearn.datasets.california_housing import fetch_california_housing
def main():
cal_housing = fetch_california_housing()
# import ipdb; ipdb.set_trace()
# split 80/20 train-test
X_train, X_test, y_train, y_test = train_test_split(cal_housing.data,
cal_housing.target,
test_size=0.2,
random_state=1)
names = cal_housing.feature_names
print('_' * 80)
print("Training GBRT...")
clf = GradientBoostingRegressor(n_estimators=100, max_depth=12, min_samples_split=10,
learning_rate=0.15, loss='ls', subsample=0.5,
random_state=1)
clf.fit(X_train, y_train)
print("done.")
print('_' * 80)
print('Convenience plot with ``partial_dependence_plots``')
print
features = [0, 5, 1, 2, (5, 1)]
fig, axs = plot_partial_dependence(clf, X_train, features,
feature_names=names,
n_jobs=3, grid_resolution=100)
fig.suptitle('Partial dependence of house value on nonlocation features\n'
'for the California housing dataset')
plt.subplots_adjust(top=0.9) # tight_layout causes overlap with suptitle
plt.show()
# Needed on Windows because plot_partial_dependence uses multiprocessing
if __name__ == '__main__':
main()



