数据挖掘 - 为什么我的时间序列数据在训练/测试拆分之前没有被洗牌，而在拆分之前它被洗牌时性能更差？ - 吾爱随笔录

我们在时间序列数据上运行 RandomForest 模型。该模型是实时运行的，并且每次添加新行时都会重新调整。由于它是时间序列数据，我们将 shuffle 设置为 false，同时拆分为训练和测试数据集。

我们观察到，当 shuffle 为 True 和 shuffle 为 false 时，分数会发生巨大变化

正在使用的代码如下

# Set shuffle = 'True' or 'False'
df = pandas.read_csv('data.csv', index_col=0)
X = df.drop(columns=['label'])
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, shuffle=True)

count = 0 
predictions = [] 

for idx in X_test.index.values:
    # Train the model on training data
    # print(count, X_train.shape, y_train.shape)
    rf = RandomForestRegressor(n_estimators = 600, max_depth = 7, random_state = 12345)
    rf.fit(X_train, y_train)

    predictions.append(rf.predict(X_test.loc[X_test.index == idx]))
    # print(len(predictions))

    X_train.loc[len(X_train)] = X_test.loc[idx]
    y_train.loc[len(y_train)] = y_test.loc[idx]
    count+=1

最初，我们认为差异是由于数据的协方差变化造成的。但这不应该对连续拟合产生太大影响

这是预测图的屏幕截图

使用 shuffle = False

随机播放 = True