数据挖掘 - 是否可以基于两列对一组进行分层训练测试拆分？ - 吾爱随笔录

是否可以基于两列对一组进行分层训练测试拆分？

数据挖掘 Python scikit-学习数据集熊猫

2021-10-09 19:24:13

考虑一个包含两列的数据框，text并且label. 我可以使用sklearn.model_selection.train_test_split非常轻松地创建分层训练测试拆分。我唯一要做的就是设置我想用于分层的列（在这种情况下label）。

现在，考虑一个包含三列、text、subreddit和的数据框label。我想使用该label列进行分层训练测试拆分，但我也想确保该subreddit列没有偏差。例如，测试集可能有更多来自 subreddit X 的评论，而训练集没有。

我怎样才能在 Python 中做到这一点？

1个回答

一种选择是将两个变量的数组提供给stratify也接受多维数组的参数。以下是 scikit 文档中的描述：

分层数组，默认=无

如果不是 None，则以分层方式拆分数据，将其用作类标签。

这是一个例子：

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# create dummy data with unbalanced feature value distribution
X = pd.DataFrame(np.concatenate((np.random.randint(0, 3, 500), np.random.randint(0, 10, 500)),axis=0).reshape((500, 2)), columns=["text", "subreddit"])
y = pd.DataFrame(np.random.randint(0,2, 500).reshape((500, 1)), columns=["label"])

# split stratified to target variable and subreddit col
X_train, X_test, y_train, y_test = train_test_split(
    X, pd.concat([X["subreddit"], y], axis=1), stratify=pd.concat([X["subreddit"], y], axis=1))

# remove subreddit cols from target variable arrays
y_train = y_train.drop(["subreddit"], axis=1)
y_test = y_test.drop(["subreddit"], axis=1)

如您所见，拆分也分层为subreddit：

为 subreddits 训练数据共享

X_train.groupby("subreddit").count()/len(X_train)

给

text
subreddit   
0   0.232000
1   0.232000
2   0.213333
3   0.034667
4   0.037333
5   0.045333
6   0.056000
7   0.056000
8   0.048000
9   0.045333

子版块的测试数据共享

X_test.groupby("subreddit").count()/len(X_test)

给

当然，这仅在您有足够的数据可以同时分层到subreddit目标变量时才有效。否则 scikit learn 会抛出异常。

其它你可能感兴趣的问题

上一篇RNN 怎么可能做情感分析？下一篇如何在csv中获得K个最不同的行？