数据挖掘 - 如何使用 sklearn train_test_split 对数据进行分层以进行多标签分类？ - 吾爱随笔录

如何使用 sklearn train_test_split 对数据进行分层以进行多标签分类？

数据挖掘机器学习 scikit-学习多标签分类

2021-09-17 10:48:16

我正在尝试模仿 Ahmed Besbes的机器学习程序，但已针对多标签分类进行了扩展。似乎任何对数据进行分层的尝试都会返回以下错误：The least populated class in y has only 1 member, which is too few. The minimum number of labels for any class cannot be less than 2.

在我的数据集中，我有 1 列包含干净的标记化文本。其他 8 列用于基于该文本内容的分类。请注意，第 1 - 4 列的样本明显多于 5 - 8 列（来自文本的更模糊的分类）。

这是我的代码中的通用示例：

x = data['cleaned_text']
y = data[['car','truck','ford','chevy','black','white','parked', 'driving']]

x_train, x_test, y_train, y_test = train_test_split(x,
                                                    y,
                                                    test_size=0.1,
                                                    random_state=42)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

输出： (6293,) (700,) (6293, 8) (700, 8)

添加stratify=y到train_test_split返回前面提到的错误。即使我将 y 限制为一列，我仍然会收到错误消息。

如何对数据进行分层，以便让程序在训练集中有一个公平的外观？

3个回答

试试这个：

from skmultilearn.model_selection import iterative_train_test_split X_train, y_train, X_test, y_test = iterative_train_test_split(x, y, test_size = 0.1)

由于您正在进行多标签分类，因此很可能会获得每个类的唯一组合，这就是导致 sklearn 错误的原因。您必须使用特殊库进行多标签分层拆分。

有关如何使用 skmultilearn 的更多详细信息

您收到的错误表明它无法进行分层拆分，因为您的一个班级只有一个样本。每个类至少需要两个样本，以便将一个放入训练分组，一个放入测试分组。你应该检查你的班级细分是什么，以找到罪魁祸首。

有一个用于类分层的单独模块，没有人会建议您为此使用 train_test_split。这可以通过以下方式实现：

from sklearn.model_selection import StratifiedKFold


train_all = []
evaluate_all = []
skf = StratifiedKFold(n_splits=cv_total, random_state=1234, shuffle=True)
for train_index, evaluate_index in skf.split(train_df.index.values, train_df.coverage_class):
    train_all.append(train_index)
    evaluate_all.append(evaluate_index)
    print(train_index.shape,evaluate_index.shape) # the shape is slightly different in different cv, it's OK

# Getting each batch
def get_cv_data(cv_index):
    train_index = train_all[cv_index-1]
    evaluate_index = evaluate_all[cv_index-1]
    x_train = np.array(train_df.images[train_index].map(upsample).tolist()).reshape(-1, img_size_target, img_size_target, 1)
    y_train = np.array(train_df.masks[train_index].map(upsample).tolist()).reshape(-1, img_size_target, img_size_target, 1)
    x_valid = np.array(train_df.images[evaluate_index].map(upsample).tolist()).reshape(-1, img_size_target, img_size_target, 1)
    y_valid = np.array(train_df.masks[evaluate_index].map(upsample).tolist()).reshape(-1, img_size_target, img_size_target, 1)
    return x_train,y_train,x_valid,y_valid

# Training loop
for cv_index in range(cv_total):
    x_train, y_train, x_valid, y_valid =  get_cv_data(cv_index+1)
    history = model.fit(x_train, y_train,
                        validation_data=[x_valid, y_valid], 
                        epochs=epochs)

这是一个简单的代码片段，用于在您的代码中使用 StratifiedKFold。只需相应地替换所需的参数和超参数即可。

其它你可能感兴趣的问题

上一篇“曲线”是否被视为“线性”？下一篇如何在 Python 中使用 Groupby 计算累积和？