数据挖掘 - 如何估计 scikit-learn 中回归变量的方差？ - 吾爱随笔录

如何估计 scikit-learn 中回归变量的方差？

数据挖掘 Python scikit-学习回归方差

2021-10-14 12:13:28

scikit-learn 中的每个分类器都有一个方法predict_proba(x)来预测x. 如何为回归者做同样的事情？

我知道如何估计预测方差的唯一回归器是高斯过程回归，对此我可以执行以下操作：

y_pred, sigma = gp.predict(x, return_std=True)

在一个维度上，我什至可以绘制高斯过程回归器对其预测不同数据点的信心程度

如何估计其他回归变量的预测方差？例如，对于核岭回归器、多层感知器、集成回归器？

3个回答

我相信这是模型的概率性质，可以让您获得预测的方差，或者更一般地定义为预测的不确定性，例如您提到的高斯过程。这不仅仅在标准回归器中可用。

如果您想估计模型的不确定性，我认为您应该查看像BayesianRidge这样的概率回归量。在scikit-learn中也可以使用一个实现，这也是这个基于 PyMC3 或直接通过PyMC3本身的不错的 python 包。在后者中有一些例子，比如 Jupyter Notebook 中的贝叶斯回归，有很好的解释。

原则上，贝叶斯模型不会返回对模型参数的单一估计，而是一个分布，它可以对新的观察结果进行推断，并检查我们在模型中的不确定性。您可能会发现这篇文章很有用。

注意：像在贝叶斯回归中所做的那样，在权重上添加一个正态先验，将最小二乘问题也转为正则化 L2 回归（请参阅此处的完整数学推导）。

更新答案：我完全忘记了计算机器学习算法置信区间的经典但简单而强大的Bootstrap Sampling方法。教科书的定义说：

Bootstrapping 是一种统计推断的非参数方法，它用计算代替了更传统的分布假设和渐近结果。多项优势：

引导程序非常通用，尽管在某些情况下它会失败。

因为它不需要分布假设（例如正态分布的误差），所以当数据表现不佳或样本量较小时，bootstrap 可以提供更准确的推断。

可以将 bootstrap 应用于具有难以推导的抽样分布的统计数据，即使是渐近的。

将引导程序应用于复杂的数据收集计划（例如分层和聚类样本）相对简单。

参考：福克斯，约翰。应用回归分析和广义线性模型。圣人出版物，2015 年。

请注意，您不需要具有概率性质的模型。看到这个帖子，或者这个答案或者这个。

我有一个想法，但我不确定它是否正确。请随时表达您对以下解决方案的任何意见或情绪。

分类和回归任务非常相似。例如，如果通过神经网络完成，那么用于回归的网络将与用于分类的相应网络仅在输出神经元的激活函数和损失函数方面有所不同。

思路是将回归任务的目标变量分箱，对分箱的标签进行分类，然后predict_proba用来得到预测值在一定区间内的概率。

初始回归任务的预测概率可以根据predict_proba相应分类的结果来估计。

这就是问题图片中显示的相同玩具问题的方法。任务是学习一维高斯函数

def gaussian(x, mu, sig):
    return np.exp(-np.square((x-mu)/sig)/2)

给定一些训练数据。

我在 Keras 中构建了以下神经网络：

网络同时进行分类和回归训练。它只在最后一层分裂。输入是一维的。隐藏层有 10 个神经元。回归的输出层是一个具有线性激活的神经元。分类的输出层有几个 softmax 神经元。它们的数量取决于用目标变量填充的箱数。

在这个玩具示例中，我有 6 个训练数据点：

# training data
x_train = np.atleast_2d([3.2487, -1.2235, -10.0, 10.0, -5.7789, 6.6834]).T
y_train = gaussian(x_train, mu, sig)

我将目标变量变化（0 到 1）的整个范围划分为 10 个箱。每个 bin 的宽度为 0.1。bin 的数量可以被认为是一个超参数。bin 越多，分类问题越接近相应的回归问题。但是太多的垃圾箱可能并不好。

# Binning the target variable
hist, bin_edges = np.histogram(y_train, bins=np.linspace(0, 1, 11))
y_c = np.digitize(y_train, bin_edges)
n_classes = len(np.unique(y_c))

# Binarize targets for classification
lb = LabelBinarizer()
y_b = lb.fit_transform(y_c)

训练数据分为三个 bin。您可以在图片中看到原因。每边较远的四个点（左侧两个，右侧两个）都在一个 bin 中，中间的其余每个点都在一个单独的 bin 中。剩下的七个垃圾箱是空的。因此，分类任务的输出层有 3 个 softmax 神经元。我对标签使用 1-hot 编码。

这是网络：

# NNet with one input and two outputs: one for reg, another for clf
main_input = Input(shape=(1,), dtype='float32', name='main_input')
hidden = Dense(10, input_dim=1, activation='tanh')(main_input)
reg_output = Dense(1, activation='linear', name='reg_output')(hidden)
clf_output = Dense(n_classes, activation='softmax', name='clf_output')(hidden)
model = Model(inputs=[main_input], outputs=[reg_output, clf_output])

不同的损失函数用于分类和回归。我还分配了不同的损失权重，可以将其视为另一个超参数。

model.compile(optimizer='adam',
              loss={'reg_output': 'mse', 'clf_output': 'categorical_crossentropy'},
              loss_weights={'reg_output': 1., 'clf_output': 0.2})

训练：

model.fit({'main_input': x_train},
          {'reg_output': y_train, 'clf_output': y_b},
          epochs=1000, verbose=0)

运行model.predict给出回归的预测和分类的预测概率。

# Prediction for both classification and regression 
y_pred, pred_proba_c = model.predict({'main_input': x})

数组的每一行都pred_proba_c包含将测试点放入三个类别之一的概率。predict_proba我通过取这三个概率中的最大值来估计回归的类似物。

# This is a regression's analogue of predict_proba 
r_pred_proba = np.max(pred_proba_c, axis=1)

这就是结果。预测概率显示在图片的下半部分。

直观地说，在有训练数据的地方概率很高，在训练数据之间的区域概率会降低。该模型对远离训练数据的预测变得不太确定。

预测概率的最大值并不完全在训练点。这可能是因为基础分类和回归问题之间没有精确的对应关系。它们相关但不相同，它们之间的关系取决于超参数的值和学习算法。例如，如果我改变损失权重，

model.compile(optimizer='adam',
              loss={'reg_output': 'mse', 'clf_output': 'categorical_crossentropy'},
              loss_weights={'reg_output': 1., 'clf_output': 1})

我得到以下图片：

现在预测概率值不同，但定性行为是相同的。

完整代码如下：

import numpy as np 
import matplotlib.pyplot as plt
from keras.models import Model
from keras.layers import Input, Dense
from sklearn.preprocessing import LabelBinarizer

np.random.seed(1)

x = np.atleast_2d(np.linspace(-10, 10, 200)).T

mu = 0
sig = 2

def gaussian(x, mu, sig):
    return np.exp(-np.square((x-mu)/sig)/2)

# training data
x_train = np.atleast_2d([3.2487, -1.2235, -10.0, 10.0, -5.7789, 6.6834]).T
y_train = gaussian(x_train, mu, sig)

# Binning the target variable
hist, bin_edges = np.histogram(y_train, bins=np.linspace(0, 1, 11))
y_c = np.digitize(y_train, bin_edges)
n_classes = len(np.unique(y_c))

# Binarize targets for classification
lb = LabelBinarizer()
y_b = lb.fit_transform(y_c)

# NNet with one input and two outputs: one for reg, another for clf
main_input = Input(shape=(1,), dtype='float32', name='main_input')
hidden = Dense(10, input_dim=1, activation='tanh')(main_input)
reg_output = Dense(1, activation='linear', name='reg_output')(hidden)
clf_output = Dense(n_classes, activation='softmax', name='clf_output')(hidden)
model = Model(inputs=[main_input], outputs=[reg_output, clf_output])

model.compile(optimizer='adam',
              loss={'reg_output': 'mse', 'clf_output': 'categorical_crossentropy'},
              loss_weights={'reg_output': 1., 'clf_output': 0.2})

model.fit({'main_input': x_train},
          {'reg_output': y_train, 'clf_output': y_b},
          epochs=1000, verbose=0)


# Prediction for both classification and regression 
y_pred, pred_proba_c = model.predict({'main_input': x})

# This is a regression's analogue of predict_proba 
r_pred_proba = np.max(pred_proba_c, axis=1)

f, ax = plt.subplots(2, sharex=True)
ax[0].plot(x, gaussian(x, mu, sig), color="red", label="ground truth")
ax[0].scatter(x_train, y_train, color='navy', s=30, marker='o', label="training data")
ax[0].plot(x, y_pred, 'b-', color="blue", label="prediction")
ax[0].legend(loc='best')
ax[0].grid()
ax[1].plot(x, r_pred_proba, color="navy", label="prediction probability")
ax[1].legend(loc='best')
ax[1].grid()
plt.show()

有一些论文使用 dropout 研究深度学习模型中的不确定性。例如看看

作为贝叶斯近似的辍学：在分类中使用贝叶斯神经网络表示深度学习中的模型不确定性和不确定性量化：在生物医学图像分割中的应用

据我了解，在预测时启用辍学允许运行一种蒙特卡罗模拟，因此您可以获得这些模拟的平均值。对于分类任务，这些模拟之后是通过利用输出2的离散性来估计模型的认知和任意性。但是，我不太清楚在回归的情况下这是如何工作的。但我想到的一个想法是，我们可以使用 Monte Carlo 运行中预测值的均值和标准差来估计置信区间

confidence interval = μ \pm t \times \frac{σ}{\sqrt{T}}

$\begin{equation} \text{confidence interval} = \mu \pm t \times \frac{\sigma}{\sqrt{T}} \end{equation}$ 在哪里

μ

$\mu$ 和

σ

$\sigma$ 是从蒙特卡洛运行中获得的平均值和标准偏差，

t

$t$ 是从 t 分布表和使用自由度和

T

$T$ 是蒙特卡罗模拟的次数。我不确定这是否是衡量不确定性的好方法，我想得到一些反馈，因为我正在处理类似的问题。

其它你可能感兴趣的问题

上一篇处理不平衡类的方法的分类下一篇如何在 backprop 中应用 softmax 的梯度