数据挖掘 - 如何使用sklearn进行逐步回归？ - 吾爱随笔录

如何使用sklearn进行逐步回归？

数据挖掘机器学习 scikit-学习回归特征选择线性回归

2021-09-27 03:12:52

我在 scikit learn 中找不到逐步回归的方法。我已经检查了 Stack Exchange 上有关此主题的所有其他帖子。所有这些问题的答案都建议使用 f_regression。

但是 f_regression 不做逐步回归，只是给出每个回归量对应的 F-score 和 pvalues，这只是逐步回归的第一步。

选择具有最佳 f 分数的第一个回归变量后该怎么办？

1个回答

Scikit-learn 确实不支持逐步回归。这是因为通常所说的“逐步回归”是一种基于线性回归系数的 p 值的算法，而 scikit-learn 故意避免使用推理方法进行模型学习（显着性测试等）。此外，纯 OLS 只是众多回归算法中的一种，从 scikit-learn 的角度来看，它既不是很重要，也不是最好的算法之一。

但是，对于那些仍然需要使用线性模型进行特征选择的好方法的人，有一些建议：

使用固有的稀疏模型，如ElasticNetor Lasso。
使用规范化您的功能StandardScaler，然后仅按对您的功能进行排序model.coef_。对于完全独立的协变量，它相当于按 p 值排序。该课程sklearn.feature_selection.RFE将为您完成，RFECV甚至会评估最佳数量的功能。
通过调整使用前向选择的实现 $R^2$ 与statsmodels.
进行蛮力向前或向后选择，以最大化您最喜欢的交叉验证指标（它可能需要大约协变量数量的二次时间）。一个 scikit-learn 兼容mlxtend包支持任何估计器和任何度量的这种方法。
如果你仍然想要普通的逐步回归，它更容易基于statsmodels，因为这个包会为你计算 p 值。一个基本的前后选择可能如下所示：

```

from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import statsmodels.api as sm

data = load_boston()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target


def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out = 0.05, 
                       verbose=True):
    """ Perform a forward-backward feature selection 
    based on p-value from statsmodels.api.OLS
    Arguments:
        X - pandas.DataFrame with candidate features
        y - list-like with the target
        initial_list - list of features to start with (column names of X)
        threshold_in - include a feature if its p-value < threshold_in
        threshold_out - exclude a feature if its p-value > threshold_out
        verbose - whether to print the sequence of inclusions and exclusions
    Returns: list of selected features 
    Always set threshold_in < threshold_out to avoid infinite looping.
    See https://en.wikipedia.org/wiki/Stepwise_regression for the details
    """
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.argmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

        # backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            changed=True
            worst_feature = pvalues.argmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included

result = stepwise_selection(X, y)

print('resulting features:')
print(result)

此示例将打印以下输出：

Add  LSTAT                          with p-value 5.0811e-88
Add  RM                             with p-value 3.47226e-27
Add  PTRATIO                        with p-value 1.64466e-14
Add  DIS                            with p-value 1.66847e-05
Add  NOX                            with p-value 5.48815e-08
Add  CHAS                           with p-value 0.000265473
Add  B                              with p-value 0.000771946
Add  ZN                             with p-value 0.00465162
resulting features:
['LSTAT', 'RM', 'PTRATIO', 'DIS', 'NOX', 'CHAS', 'B', 'ZN']

其它你可能感兴趣的问题

上一篇优先重放，重要性采样到底做了什么？下一篇神经网络是否像决策树一样具有可解释性？