R和Python计算的部分依赖的差异

机器算法验证 r 回归 scikit-学习 助推
2022-04-12 16:38:22

我注意到 R 包gbm和 Python 的scikit-learn.

这是gbm加利福尼亚住房数据集的中值对收入中值的部分依赖:

在此处输入图像描述

这是scikit-learn': 在此处输入图像描述

不难看出,R 的部分依赖范围为 1.5 到 4.5,而 的部分依赖范围为scikit-learn-0.5 到 1.5,但线条的形状几乎相同。我不明白为什么会这样。

相关代码:

R

library(oem)
library(gbm)
data(calHousing)
X <- calHousing[ ,!(colnames(calHousing) == "medianValue")]
y <- calHousing$medianValue / 100000
gbm.model <- gbm.fit(X, y, distribution="gaussian", n.trees=100, interaction.depth=12, shrinkage=0.15)
plot(gbm.model, i.var="medianIncome")

Python 代码是来自scikit-learn示例页面”的复制粘贴。

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.ensemble.partial_dependence import partial_dependence
from sklearn.datasets.california_housing import fetch_california_housing


def main():
    cal_housing = fetch_california_housing()
    # import ipdb; ipdb.set_trace()
    # split 80/20 train-test
    X_train, X_test, y_train, y_test = train_test_split(cal_housing.data,
                                                        cal_housing.target,
                                                        test_size=0.2,
                                                        random_state=1)
    names = cal_housing.feature_names

    print('_' * 80)
    print("Training GBRT...")
    clf = GradientBoostingRegressor(n_estimators=100, max_depth=12, min_samples_split=10,
                                    learning_rate=0.15, loss='ls', subsample=0.5,
                                    random_state=1)
    clf.fit(X_train, y_train)
    print("done.")

    print('_' * 80)
    print('Convenience plot with ``partial_dependence_plots``')
    print

    features = [0, 5, 1, 2, (5, 1)]
    fig, axs = plot_partial_dependence(clf, X_train, features,
                                       feature_names=names,
                                       n_jobs=3, grid_resolution=100)
    fig.suptitle('Partial dependence of house value on nonlocation features\n'
                 'for the California housing dataset')
    plt.subplots_adjust(top=0.9)  # tight_layout causes overlap with suptitle
    plt.show()


# Needed on Windows because plot_partial_dependence uses multiprocessing
if __name__ == '__main__':
    main()
1个回答

Scikit-learn以目标值的平均值为中心,R 没有。

这是一个使用糖尿病数据集的示例。

R

data(diabetes, package="lars")

y        <- diabetes$y
x        <- diabetes$x
class(x) <- "matrix"
data     <- data.frame(y, as.data.frame(x))

model <- gbm::gbm(formula = y ~ . , data = data, distribution = "gaussian", 
                  shrinkage = 1, bag.fraction = 1, n.trees = 100,
                  interaction.depth = 2, verbose = T, keep.data = F)

partial <- plot.gbm(dgbm, i.var = 1, return.grid = T)
plot(partial[, 2] - mean(y), type = "l")

R 部分图

Python

import numpy as np
import sklearn
import matplotlib.pyplot as plt
import sklearn.datasets
import sklearn.ensemble
from sklearn.ensemble.partial_dependence import partial_dependence

diabetes = sklearn.datasets.load_diabetes()
X= diabetes.data
y= diabetes.target



gbm = sklearn.ensemble.GradientBoostingRegressor(loss='ls', learning_rate=1, max_leaf_nodes=3, min_samples_leaf=10,
                                             n_estimators=100, verbose=True)
model_gbm = gbm.fit(X, y)

partial, axe = partial_dependence(gbrt=model_gbm, X=X, target_variables=(0))

plt.plot(partial.T)

在此处输入图像描述