数据挖掘 - 在简单的 1D 数据集上，LogisticRegressionCV 选择了糟糕的超参数，结果分数是荒谬的 - 吾爱随笔录

在简单的 1D 数据集上，LogisticRegressionCV 选择了糟糕的超参数，结果分数是荒谬的

数据挖掘 scikit-学习逻辑回归正则化

2021-09-22 23:22:06

我正在尝试使用 LogisticRegressionCV 将逻辑回归模型拟合到简单的一维数据集。非常奇怪的是，当给出选择时，它似乎选择了一个很小的 C 值，这迫使我的模型选择一个很小的 theta，从而导致一个无用的模型。

我尝试查看模型提供的分数，但它们没有任何意义。例如，当我告诉它使用 5 个 C 值选择进行 3 折交叉验证时，它给了我：

{1: array([[0.47058824, 0.47058824, 0.47058824, 0.47058824, 0.47058824],
        [1.        , 1.        , 1.        , 1.        , 1.        ],
        [0.63636364, 0.63636364, 0.63636364, 0.63636364, 0.63636364]])}

该数据集不是线性可分的，但它声称无论我给它尝试哪个 C 值都可以获得 100% 的准确度。

下面的示例代码：

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LogisticRegression

def gen_y(x):
    p1 = np.clip(x + 0.5, 0, 1)
    v = np.random.uniform(0, 1)
    if v < p1:
        return 1
    return 0

np.random.seed(6)
x_data = np.sort(np.random.normal(0, 0.3, 100))
y_data = np.array([gen_y(x) for x in x_data])

regularized_logistic_regression_model = LogisticRegressionCV(Cs = np.array([10**-8, 10**-4, 1, 10**4, 10**8]), fit_intercept = False, cv = 3)

regularized_logistic_regression_model.fit(x_data.reshape(-1, 1), y_data)

print(regularized_logistic_regression_model.C_) # yields 10^-8
print(regularized_logistic_regression_model.coef_) # yields incredibly tiny value
print(regularized_logistic_regression_model.scores_) # yields nonsensical scores

2个回答

首先，我对您的代码在做什么进行了可视化（请参见底部的代码）

该模型似乎完全正常。线性回归的系数接近于 0，应该说明您是如何创建数据的。

你误会了

regularized_logistic_regression_model.scores_

{1：阵列（[[0.47058824,0.47058824,0.47058824,0.47058824]，[1.，1.，1.，1.，1.，1.]，[0.6366364,0.63636364,0.63636364,0.63636364,0.63636364]）}

我引用 sklearn 文档：

score_dict：以类为键的字典，值作为在交叉验证每个折叠期间获得的分数网格

这是其中一个折叠的预测（请注意，如果增加 cv 参数，它会改变长度）

创建的数据已排序！如果您绘制 X 与索引的值，您可以看到

您得到这些结果是因为数据已排序！不仅仅是因为运气。如果你对数据进行洗牌，你将不会有完美的预测。

您的结果大约是 0.7，这实际上是通过查看我附加的图像才有意义的。

我打乱了数据，不是以最优雅的方式，但你现在得到不同的结果。

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LogisticRegression
plt.style.use('seaborn-whitegrid')

def gen_y(x):
    p1 = np.clip(x + 0.5, 0, 1)
    v = np.random.uniform(0, 1)
    if v < p1:
        return 1
    return 0

np.random.seed(6)

x_data = np.sort(np.random.normal(0, 0.3, 100))
y_data = np.array([gen_y(x) for x in x_data])





df = pd.DataFrame(x_data.reshape(-1, 1),columns=['X'])
df['y']= y_data

plt.figure()
plt.title('Values of X vs Index')
df.X.plot()
plt.savefig('x')
plt.show()

df = df.sample(frac=1)


regularized_logistic_regression_model = LogisticRegressionCV( fit_intercept = False, cv = 3)

regularized_logistic_regression_model.fit(df['X'].values.reshape(-1, 1) , df['y'])




plt.figure()

df['results'] =regularized_logistic_regression_model.predict(df['X'].values.reshape(-1, 1))
plt.scatter(df[df['results']==0].y,df[df['results']==0].X,label='pred=0')
plt.scatter(df[df['results']==1].y,df[df['results']==1].X,label='pred=1')
plt.legend()
plt.hlines(y=regularized_logistic_regression_model.coef_.squeeze(),xmin=-0.1,xmax=1.1)
plt.savefig('ex')
plt.show()

regularized_logistic_regression_model.scores_

两个基本错误：

没有拦截
无乱序的排序数据

另外：相当小的数据集。

截距显着提高了逻辑回归的表达能力，尤其是在只有一个特征的问题中，比如这里。它的默认设置是有原因的True- 它是您最好不要混合使用的默认设置之一，除非您确切知道自己在做什么。在像这里这样的简单单变量情况下省略截距很容易想象：它迫使回归线穿过原点 (0, 0) - 一个巨大的约束。

在人工数据集的这种情况下，洗牌尤其重要，其中在某些时候对值进行了排序（就像您在此处所做的那样）。原因是，虽然 ML 模型可以非常擅长插值，但它们非常不擅长外插（预测超出其训练范围的值）；对于已排序的数据，您的每个验证 CV 折叠都尝试使用各自训练折叠之外的数据进行预测（不出所料，效果不佳）。

因此，只需对数据进行洗牌，并与这些数据进行拟合fit_intercept = True，我们得到：

from sklearn.utils import shuffle

x_s, y_s = shuffle(x_data, y_data, random_state=0)

regularized_logistic_regression_model = LogisticRegressionCV(
    Cs = np.array([10**-8, 10**-4, 1, 10**4, 10**8]), fit_intercept = True, cv = 3)

regularized_logistic_regression_model.fit(x_s.reshape(-1, 1), y_s)

print(regularized_logistic_regression_model.C_) 
print(regularized_logistic_regression_model.coef_) 
print(regularized_logistic_regression_model.scores_)

结果：

[10000.]
[[4.57770177]]
{1: array([[0.61764706, 0.61764706, 0.70588235, 0.67647059, 0.67647059],
       [0.60606061, 0.60606061, 0.78787879, 0.84848485, 0.84848485],
       [0.60606061, 0.60606061, 0.57575758, 0.72727273, 0.72727273]])}

已经比你报告的那些更明智了。

添加更多数据（300 个样本而不是 100 个），给出

[1.]
[[3.57243675]]
{1: array([[0.52, 0.52, 0.72, 0.72, 0.72],
       [0.52, 0.52, 0.68, 0.67, 0.67],
       [0.51, 0.51, 0.67, 0.66, 0.66]])}

最后说明：虽然洗牌通常是一种强烈推荐的做法，但在这里（根据定义，人工随机数据）如果您将初始数据保持原样（即不对其进行排序），则可以避免使用它：

x_data = np.random.normal(0, 0.3, 100) # no sorting

我将对此进行验证作为练习。

其它你可能感兴趣的问题

上一篇当一个特征仅适用于模型中的某个组时如何在模型中插入两个特征下一篇将具有数字的列转换为 0 到 1 之间的范围是一种好习惯吗？