数据挖掘 - 使用 GridSearchCV 和 make_scorer 时跟踪底层观察 - 吾爱随笔录

使用 GridSearchCV 和 make_scorer 时跟踪底层观察

数据挖掘 Python scikit-学习超参数调整网格搜索网格搜索

2022-03-13 18:39:13

我正在做一个 GridSearchCV，我已经定义了一个自定义函数（下面称为 custom_scorer）来优化。所以设置是这样的：

gs = GridSearchCV(estimator=some_classifier,
                  param_grid=some_grid,
                  cv=5,  # for concreteness
                  scoring=make_scorer(custom_scorer))

gs.fit(training_data, training_y)

这是一个二元分类。因此，在网格搜索期间，对于超参数的每个排列，在对其他 4 个折叠进行训练之后，在 5 个剩余折叠中的每一个上计算自定义得分值。

custom_scorer 是一个带有 2 个输入的定标器值函数：一个数组 $y$ 包含基本事实（即 0 和 1）和一个数组 $y_{pred}$ 包含预测概率（为 1，“正”类）：

def custom_scorer(y, y_pred):
    """
    (1) y contains ground truths, but only for the left-out fold
    (2) Similarly, y_pred contains predicted probabilities, but only for the left-out fold
    (3) So y, y_pred is each of length ~len(training_y)/5
    """

    return scaler_value

但是假设 custom_scorer 返回的 scaler_value 不仅取决于 $y$ 和 $y_{pred}$ ，但也知道哪些观察被分配到了左侧折叠。 如果我只有 $y$ 和 $y_{pred}$ （再次：分别为左侧折叠的基本事实和预测概率）当调用 custom_scorer 方法时，我不知道哪些行属于此折叠。我需要一种方法来跟踪在调用 custom_scorer 时将哪些 training_data 行分配给左侧折叠，例如行的索引。

关于最简单的方法的任何想法？如果需要澄清，请告诉我。谢谢！

1个回答

首先; 这是一个非常清楚，写得很好的问题。赞！

我认为答案是从 CV 中取出折叠并手动执行此操作。您可以使用生成训练和测试数据的索引KFold().split()，并以这种方式对其进行迭代：

from sklearn.model_selection import KFold, GridSearchCV
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

kf = KFold(n_splits=3)

for train_idx, test_idx in kf.split(iris):
    print(train_idx, test_idx)

您将得到三组 2 个数组，第一个是此折叠的训练样本的索引，第二个是此折叠的测试样本的索引。使用它，您可以像这样手动交叉验证：

kf = KFold(n_splits=3)

x = iris.drop('species', axis=1)
y = iris.species

max_depths = [5, 10, 15]

scores = []

for i in range(len(max_depths)):
    rfc = RandomForestClassifier(max_depth=max_depths[i])
    scores.append({'max_depth':max_depths[i], 'scores':[]})
    for train_idx, test_idx in kf.split(iris):
        rfc.fit(x.iloc[train_idx], y.iloc[train_idx])
        scores[i]['scores'].append(custom_scorer(y.iloc[test_idx], rfc.predict(x.iloc[test_idx]), train_idx, test_idx)

因此，max_depths 中的每个值运行一次，将该参数设置为 RandomForestClassifier 中的适当值。然后它适合 3 次，每次定义的折叠一次KFold()，并将几件事传递给调用custom_scorer()...

y.iloc[test_idx]这是我们的 y_true
rfc.predict(x.iloc[test_idx])这是我们的 y_pred
train_idx这是我们训练数据样本的索引
test_idx这是我们测试数据样本的指标

希望有帮助。出于兴趣：为什么你需要知道哪些观察被遗漏了？

其它你可能感兴趣的问题

上一篇为什么更大的嵌入向量不一定更好？下一篇当测试集仍然不平衡时，“过采样”的目的是什么？