数据挖掘 - 当我的测试和验证分数很好，但提交很糟糕时，我该怎么办？ - 吾爱随笔录

这是一个非常广泛的问题，我理解，如果有人认为这样做不合适，我完全可以。但不明白这一点让我很生气......

事情是这样的，我正在做一个机器学习模型来预测推文主题。我正在参加这个比赛。所以这就是我所做的，以确保我没有过度拟合：我分离了 10% 的训练数据，我称之为验证集，我用剩下的（90%）来准备我的模型。所以我 90% 的数据分为训练集和测试集。所以基本上我有两个数据集来测试我的模型，测试集和验证集。所有的结果都很棒！测试集和验证集都给了我很好的结果。我还做了一个 Stratified K-Fold，这也让我看到了很好的结果。但是，提交集返回了 73% 的准确率。会发生什么？为什么我在测试和验证集中得到了很好的结果，但在提交中却没有那么好？有什么解释吗？这里是否发生任何数据泄漏？我发现任何泄漏都非常奇怪，因为根本没有使用验证集。但是idk会发生什么……

这是我所做的一部分，可能会导致一些泄漏（我简化了一点）：

# load training data
train_set = pd.read_csv('gender-based-violence-tweet-classification-challenge/Train.csv')

# leave 10% for validation
train = train_set.loc[:35685, ["Tweet_ID", "tweet", "type"]]
validation = train_set.loc[35685:, ["Tweet_ID", "tweet"]]

# load the test set
submission_set = pd.read_csv('gender-based-violence-tweet-classification-challenge/Test.csv')

# load submission file
submission_file = pd.read_csv('gender-based-violence-tweet-classification-challenge/SampleSubmission.csv')

def preprocess_text(text):
    STOPWORDS = stopwords.words("english")

    # Check characters to see if they are in punctuation
    nopunc = [char for char in text if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = "".join(nopunc)

    # Now just remove any stopwords
    return " ".join([word for word in nopunc.split() if word.lower() not in STOPWORDS])


X = train["tweet"]
y = train["type"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42
)

pipe = Pipeline([
("vect", CountVectorizer(analyzer=preprocess_text)),
("clf", RandomForestClassifier(class_weight='balanced'))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)