数据挖掘 - 用于极端多标签分类的训练测试拆分 - 吾爱随笔录

用于极端多标签分类的训练测试拆分

数据挖掘多标签分类

2022-02-22 13:48:55

我有一个包含数千个标签的极端多标签数据集，每个标签至少存在 10 次。

以分层方式拆分数据的最佳方法是什么？

我尝试了 scikit-multilearn 中的 iterative_train_test_split 函数，但它没有用。

有时内核崩溃，有时我会收到奇怪的错误，例如KeyError: 'key of type tuple not found and not a MultiIndex'.

如果它有任何改变，我会在带有 M1 处理器的 Mac 上工作。

谢谢

2个回答

如果数据只包含一个用于分层的给定标签的数据条目，通常会出现问题。因此，在执行分层之前，删除所有具有唯一标签的行。您可以使用collections.Counter类来做到这一点。删除这些行后，我假设您正在使用的数据框很容易分层，例如，

from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X = pd.DataFrame(iris.data)
y = iris.target
# these labels will not cause any problems
X['cat'] = np.random.choice(['label1','label2','label3','label4'],len(X))

# but these ones will, because they are unique
X.loc[37, 'cat'] = 'label5'
X.loc[137, 'cat'] = 'label6'

# this row will raise an exception if uncommented
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
#                                                    random_state=43, stratify = X.cat)

# let's drop rows with unique labels
from collections import Counter
unique_labels = [lab for lab, count in Counter(X.cat).items() if count == 1]

print(f"unique labels to be dropped: {unique_labels}")

# drop rows with unique labels
X = X[~X.cat.isin(unique_labels)]
y = y[X.index]

# now datasets X and y can be used in train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=43, stratify = X.cat)

1000 个目标标签会导致我怀疑的大多数标准桌面 ML 算法实现出现问题。根据您的数据和用例，一些公共 DL 神经网络可能能够处理此问题。

对于桌面解决方案，我建议将您的问题分解为每个目标 1000 个单独的二进制分类任务。

其它你可能感兴趣的问题

上一篇我可以使用什么工具在多线图中生成这种类型的线？下一篇用于信号调制分类的卷积神经网络