机器算法验证 - 使用 scikit-learn 进行多项式回归 - 吾爱随笔录

使用 scikit-learn 进行多项式回归

机器算法验证回归机器学习大数据多项式 scikit-学习

2022-02-09 14:29:43

我正在尝试使用 scikit-learn 进行多项式回归。从我读到的多项式回归是线性回归的一个特例。我在想，也许 scikit 的广义线性模型之一可以参数化以适应高阶多项式，但我看不出这样做的选择。

我确实设法使用带有多边形内核的支持向量回归器。这对我的数据子集很有效，但是要适应更大的数据集需要很长时间，所以我仍然需要更快地找到一些东西（即使交易一些精度）。

我在这里遗漏了一些明显的东西吗？

4个回答

理论

多项式回归是线性回归的一种特殊情况。主要思想是如何选择功能。查看具有 2 个变量的多元回归：x1和x2。线性回归将如下所示：y = a1 * x1 + a2 * x2.

现在你想要一个多项式回归（让我们做 2 次多项式）。我们将创建一些附加功能x1*x2：x1^2和x2^2。所以我们会得到你的“线性回归”：

y = a1 * x1 + a2 * x2 + a3 * x1*x2 + a4 * x1^2 + a5 * x2^2

这很好地展示了一个重要的概念维度灾难，因为随着多项式次数的增长，新特征的数量比线性增长快得多。

用 scikit-learn 练习

您不需要在 scikit 中完成所有这些操作。多项式回归已经可用（在0.15版本中。在此处查看如何更新它）。

from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model

X = [[0.44, 0.68], [0.99, 0.23]]
vector = [109.85, 155.72]
predict= [[0.49, 0.18]]
#Edit: added second square bracket above to fix the ValueError problem

poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X)
predict_ = poly.fit_transform(predict)

clf = linear_model.LinearRegression()
clf.fit(X_, vector)
print clf.predict(predict_)

给定数据，一个列向量，和的多项式来执行多项式回归。例如，考虑如果 $\mathbf{x}$ $\mathbf{y}$ $\mathbf{x}$

x = [\begin{matrix} 2 \\ - 1 \\ \frac{1}{3} \end{matrix}]

$\mathbf{x} = \begin{bmatrix} 2 \\[0.3em] -1 \\[0.3em] \frac{1}{3} \end{bmatrix}$

在线性回归中仅使用此向量意味着模型：

y = α_{1} x

$y = \alpha_1 x$

我们可以添加作为上述向量的幂的列，它们表示将多项式添加到回归中。下面我们展示了高达 3 次幂的多项式：

X = [\begin{matrix} 2 & 4 & 8 \\ - 1 & 1 & - 1 \\ \frac{1}{3} & \frac{1}{3^{2}} & \frac{1}{3^{3}} \end{matrix}]

$X = \begin{bmatrix} 2 & 4 & 8 \\[0.3em] -1 & 1 & -1 \\[0.3em] \frac{1}{3} & \frac{1}{3^2} & \frac{1}{3^3} \end{bmatrix}$

这是我们在 sklearn 的线性回归中使用的新数据矩阵，它代表了模型：

y = α_{1} x + α_{2} x^{2} + α_{3} x^{3}

$y = \alpha_1 x + \alpha_2x^2 + \alpha_3x^3$

请注意，我没有添加 1 的常量向量因为 sklearn 会自动包含它。 $1$

如果您使用的是多元回归而不仅仅是单变量回归，请不要忘记交叉项。例如，如果您有两个变量和，并且您希望多项式达到 2 次幂，则应使用，其中最后一项（）是我谈论。 $x_1$ $x_2$ $y = a_1x_1 + a_2x_2 + a_3x_1^2 + a_4x_2^2 + a_5x_1x_2$ $a_5x_1x_2$

使用与@Cam.Davidson.Pilon 类似的方法，我编写了几个函数来帮助在 Python 中演示这种方法。它可以通过在np.concatenate向量中添加更多项来扩展。的输出y_pred不会改变，但regr.coef_[0][2]需要包含获取系数。

from sklearn import linear_model
import numpy as np
from matplotlib import pyplot as plt
def regress_2nd_order(x, y):
    # The x values are transformed into the second order matrix
    X = np.concatenate([x, x**2], axis=1)
    # Use model to fit to the data 
    regr.fit(X, y)
    # Extract the values of interest for forming the equation...
    # y_pred = c0 + c1 * x + c2 * x^2 
    c0 = round(regr.intercept_[0],2)
    c1 = round(regr.coef_[0][0],2)
    c2 = round(regr.coef_[0][1],4)
    # c3 = round(regr.coef_[0][2],4)
    # Another way to get this is using the `regr.predict` function
    # Predict will calculate the values for you
    y_pred = regr.predict(X)
    # Print the intercept (c0) and the corresponding coefficients
    print('Scikit learn - \nEquation: %.2f + %.2f*T + %.5f*T^2' % (c0,c1,c2) )
    #  return y_pred, (c0, c1, c2, c3)
    return y_pred, (c0, c1, c2)
def socket_temp(x, y):
    # Include only the predicted y-vector
    y_pred = regress_2nd_order(x, y)[0]
    fig, ax = plt.subplots(1,1, figsize=(10,10))
    ax.plot(x, y_pred, ls='', marker='*', label='predicted')
    ax.plot(x, y, ls='', marker='.', label='iddt')
    ax.legend()

对于我的使用，我将温度作为y变量，将动态电源电流 (IDDT) 输出作为x变量。这是我用于x和y输出向量的值y_pred：

    'iddt'  'temp'  'y_pred'
0   627.23  20.0    20.42
1   627.38  19.9    20.52
2   612.39  9.7     10.32
3   612.48  9.7     10.38
4   597.73  -0.1    0.19
5   597.78  -0.2    0.22
6   583.24  -10.1   -9.99
7   583.24  -10.0   -9.99
8   569.22  -19.9   -19.99
9   568.97  -19.9   -20.17
10  555.62  -29.8   -29.83
11  555.99  -29.8   -29.56
12  541.15  -40.1   -40.45
13  541.50  -40.1   -40.19
14  639.88  30.3    28.89
15  640.11  30.2    29.05
16  656.61  40.2    39.92
17  656.62  40.3    39.92
18  672.73  50.4    50.33
19  672.68  50.4    50.30
20  689.56  60.6    61.01
21  689.26  60.6    60.82
22  705.30  70.8    70.79
23  705.22  70.8    70.75
24  722.27  81.0    81.14
25  722.24  81.0    81.12

其它你可能感兴趣的问题

上一篇McFadden 的伪解释R2R2 下一篇寻找一本好的和完整的概率统计书籍