我有 3 个数据集,每个数据集分为 3 个单独的类 [买入/持有/卖出]。我将每个数据集中每个类的频率随机上采样到每个 10,000 个数据点。
我的问题是,我应该在进行随机上采样之前还是之后扩展训练集?它是否以某种方式扭曲了最终的训练集?
我已经为我提供了一个平衡每个数据集的函数,请注意,此时数据已经被缩放。
def balance_dataset(df):
training_set = df[:round(len(df.values) * TRAINING_LENGTH)]
label_frequencies = training_set['Label'].value_counts(sort = True, ascending = True)
highest_occurence = resample(training_set[training_set['Label'] == label_frequencies.index[2]], n_samples = 10000, random_state = 0, replace = True)
middle_occurence = resample(training_set[training_set['Label'] == label_frequencies.index[1]], n_samples = 10000, random_state = 0, replace = True)
lowest_occurrence = resample(training_set[training_set['Label'] == label_frequencies.index[0]], n_samples = 10000, random_state = 0, replace = True)
balanced_training_set = pd.concat([highest_occurence, middle_occurence, lowest_occurrence])
return balanced_training_set[['MACD', 'MFI', 'ROC', 'RSI', 'Ultimate Oscillator', 'Williams %R', 'Awesome Oscillator', 'KAMA',
'Stochastic Oscillator', 'TSI', 'Volume Accumulator', 'ADI', 'CMF', 'EoM', 'FI', 'VPT','ADX', 'ADX Negative', 'ADX Positive',
'EMA', 'CRA']], balanced_training_set['Label']