这是基于@NoahWeber 和@etiennedm 答案的解决方案。它基于拆分的并列,1)重复的 k 折拆分(以获得培训客户和测试客户),以及 2)时间序列在每个 k 折上拆分。
此策略基于使用日期上的自定义 CV 拆分迭代器进行时间序列拆分(而通常的 CV 拆分迭代器基于样本大小/折叠数)。
提供了 sklearn 生态系统中的实现。
让我们重述这个问题。
假设您有 10 个期间和 3 个客户,索引如下:
example_data = pd.DataFrame({
'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
'cutomer': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
'date': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
})
我们使用 2 次折叠和 2 次迭代(总共 4 次)进行重复的 k 折叠,并且在每个 k 折叠拆分中,我们再次使用时间序列拆分进行拆分,这样每个时间序列拆分都有 2 次折叠
kfold split 1 : 培训客户是 [0, 1] 和测试客户是 [2]
kfold 拆分 1 时间序列拆分 1:训练索引为 [0, 1, 2, 3, 10, 11, 12, 13],测试索引为 [24, 25, 26]
kfold 拆分 1 时间序列拆分 2:训练索引为 [0, 1, 2, 3, 4, 5, 6, 10, 11, 12, 13, 14, 15, 16],测试索引为 [27, 28, 29 ]
kfold split 2 : 培训客户是 [2] 和测试客户是 [0, 1]
kfold 拆分 2 时间序列拆分 1:训练索引为 [20, 21, 22, 23],测试索引为 [4, 5, 6, 7, 15, 16, 17]
kfold 拆分 2 时间序列拆分 2:训练索引为 [20, 21, 22, 23, 24, 25, 26],测试索引为 [7, 8, 9, 17, 18, 19]
kfold 拆分 3 :培训客户为 [0, 2],测试客户为 [1]
kfold 拆分 3 时间序列拆分 1:训练索引为 [0, 1, 2, 3, 20, 21, 22, 23],测试索引为 [14, 15, 16]
kfold 拆分 3 时间序列拆分 2:训练索引为 [0, 1, 2, 3, 4, 5, 6, 20, 21, 22, 23, 24, 25, 26],测试索引为 [17, 18, 19 ]
kfold 拆分 4:培训客户为 [1],测试客户为 [0, 2]
kfold 拆分 4 时间序列拆分 1:训练索引为 [10, 11, 12, 13,],测试索引为 [4, 5, 6, 24, 25, 26]
kfold 拆分 4 时间序列拆分 2:训练索引为 [10, 11, 12, 13, 14, 15, 16],测试索引为 [7, 8, 9, 27, 28, 29]
通常,交叉验证迭代器,例如sklearn中的那些,它是基于折叠的数量,即每个折叠中的样本大小。不幸的是,这些不适合我们的 kfold / 时间序列与真实数据的拆分。事实上,没有任何东西可以保证数据随着时间和组的完美分布。(正如我们在前面的例子中假设的那样)。
例如,我们可以在测试样本(例如客户 2)中的第 4 次观察之后出现在消费者训练样本中的第 4 次观察(例如示例中 kfold 拆分 1 中的客户 0 和 1)。这违反了条件 1。
这是一种基于折叠日期(而不是样本大小或折叠次数)的 CV 拆分策略。假设您有以前的数据,但日期随机。定义一个initial_training_rolling_months,rolling_window_months。比如说 6 个月和 1 个月。
kfold split 1 : 培训客户是 [0, 1] 和测试客户是 [2]
kfold 拆分 1 时间序列拆分 1:训练样本是客户 [0, 1] 的前 6 个月,测试样本是客户训练样本后开始的月份 [2]
kfold 拆分 1 时间序列拆分 2:训练样本是客户的前 7 个月 [0, 1],测试样本是客户训练样本后开始的月份 [2]
下面是构建这样一个时间序列拆分迭代器的实施建议。
返回的迭代器是一个元组列表,您可以将其用作另一个交叉验证迭代器。
使用我们前面示例中的简单生成数据来调试折叠生成,注意客户 1(resp.2)数据从索引 366 和(resp.732)开始。
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
df = generate_happy_case_dataframe()
grouped_ts_validation_iterator = build_grouped_ts_validation_iterator(df)
gridsearch = GridSearchCV(estimator=RandomForestClassifier(), cv=grouped_ts_validation_iterator, param_grid={})
gridsearch.fit(df[['feat0', 'feat1', 'feat2', 'feat3', 'feat4']].values, df['label'].values)
gridsearch.predict([[0.1, 0.2, 0.1, 0.4, 0.1]])
使用@etiennedm 示例中随机生成的数据(为了调试拆分,我介绍了简单的情况,例如测试样本在训练样本之前或之后开始的时间)。
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
df = generate_fake_random_dataframe()
grouped_ts_validation_iterator = build_grouped_ts_validation_iterator(df)
gridsearch = GridSearchCV(estimator=RandomForestClassifier(), cv=grouped_ts_validation_iterator, param_grid={})
gridsearch.fit(df[['feat0', 'feat1', 'feat2', 'feat3', 'feat4']].values, df['label'].values)
gridsearch.predict([[0.1, 0.2, 0.1, 0.4, 0.1]])
实施:
import pandas as pd
import numpy as np
from sklearn.model_selection import RepeatedKFold
def generate_fake_random_dataframe(start=pd.to_datetime('2015-01-01'), end=pd.to_datetime('2018-01-01')):
fake_date = generate_fake_dates(start, end, 500)
df = pd.DataFrame(data=np.random.random((500,5)), columns=['feat'+str(i) for i in range(5)])
df['customer_id'] = np.random.randint(0, 5, 500)
df['label'] = np.random.randint(0, 3, 500)
df['dates'] = fake_date
df = df.reset_index() # important since df.index will be used as split index
return df
def generate_fake_dates(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.DatetimeIndex((10**9*np.random.randint(start_u, end_u, n, dtype=np.int64)).view('M8[ns]'))
def generate_happy_case_dataframe(start=pd.to_datetime('2019-01-01'), end=pd.to_datetime('2020-01-01')):
dates = pd.date_range(start, end)
length_year = len(dates)
lenght_df = length_year * 3
df = pd.DataFrame(data=np.random.random((lenght_df, 5)), columns=['feat'+str(i) for i in range(5)])
df['label'] = np.random.randint(0, 3, lenght_df)
df['dates'] = list(dates) * 3
df['customer_id'] = [0] * length_year + [1] * length_year + [2] * length_year
return df
def build_grouped_ts_validation_iterator(df, kfold_n_split=2, kfold_n_repeats=5, initial_training_rolling_months=6, rolling_window_months=1):
rkf = RepeatedKFold(n_splits=kfold_n_split, n_repeats=kfold_n_repeats, random_state=42)
CV_iterator = list()
for train_customers_ids, test_customers_ids in rkf.split(df['customer_id'].unique()):
print("rkf training/testing with customers : " + str(train_customers_ids)+"/"+str(test_customers_ids))
this_k_fold_ts_split = split_with_dates_for_validation(df=df,
train_customers_ids=train_customers_ids,
test_customers_ids=test_customers_ids,
initial_training_rolling_months=initial_training_rolling_months,
rolling_window_months=rolling_window_months)
print("In this k fold, there is", len(this_k_fold_ts_split), 'time series splits')
for split_i, split in enumerate(this_k_fold_ts_split) :
print("for this ts split number", str(split_i))
print("train ids is len", len(split[0]), 'and are:', split[0])
print("test ids is len", len(split[1]), 'and are:', split[1])
CV_iterator.extend(this_k_fold_ts_split)
print('***')
return tuple(CV_iterator)
def split_with_dates_for_validation(df, train_customers_ids, test_customers_ids, initial_training_rolling_months=6, rolling_window_months=1):
start_train_df_date, end_train_df_date, start_test_df_date, end_test_df_date = \
fetch_extremas_train_test_df_dates(df, train_customers_ids, test_customers_ids)
start_training_date, end_training_date, start_testing_date, end_testing_date = \
initialize_training_dates(start_train_df_date, start_test_df_date, initial_training_rolling_months, rolling_window_months)
ts_splits = list()
while not stop_time_series_split_decision(end_train_df_date, end_test_df_date, start_training_date, end_testing_date, rolling_window_months):
# The while implies that if testing sample is les than one month, then the process stops
this_ts_split_training_indices = fetch_this_split_training_indices(df, train_customers_ids, start_training_date, end_training_date)
this_ts_split_testing_indices = fetch_this_split_testing_indices(df, test_customers_ids, start_testing_date, end_testing_date)
if this_ts_split_testing_indices:
# If testing data is not empty, i.e. something to learn
ts_splits.append((this_ts_split_training_indices, this_ts_split_testing_indices))
start_training_date, end_training_date, start_testing_date, end_testing_date =\
update_testing_training_dates(start_training_date, end_training_date, start_testing_date, end_testing_date, rolling_window_months)
return ts_splits
def fetch_extremas_train_test_df_dates(df, train_customers_ids, test_customers_ids):
train_df, test_df = df.loc[df['customer_id'].isin(train_customers_ids)], df.loc[df['customer_id'].isin(test_customers_ids)]
start_train_df_date, end_train_df_date = min(train_df['dates']), max(train_df['dates'])
start_test_df_date, end_test_df_date = min(test_df['dates']), max(test_df['dates'])
return start_train_df_date, end_train_df_date, start_test_df_date, end_test_df_date
def initialize_training_dates(start_train_df_date, start_test_df_date, initial_training_rolling_months, rolling_window_months):
start_training_date = start_train_df_date
# cover the case where test consumers begins long after (initial_training_rolling_months after) train consumers
if start_training_date + pd.DateOffset(months=initial_training_rolling_months) < start_test_df_date:
start_training_date = start_test_df_date - pd.DateOffset(months=initial_training_rolling_months)
end_training_date = start_train_df_date + pd.DateOffset(months=initial_training_rolling_months)
start_testing_date = end_training_date
end_testing_date = start_testing_date + pd.DateOffset(months=rolling_window_months)
return start_training_date, end_training_date, start_testing_date, end_testing_date
def stop_time_series_split_decision(end_train_df_date, end_test_df_date, end_training_date, end_testing_date, rolling_window_months):
no_more_training_data_stoping_condition = end_training_date + pd.DateOffset(months=rolling_window_months) > end_train_df_date
no_more_testing_data_stoping_condition = end_testing_date + pd.DateOffset(months=rolling_window_months) > end_test_df_date
stoping_condition = no_more_training_data_stoping_condition or no_more_testing_data_stoping_condition
return stoping_condition
def update_testing_training_dates(start_training_date, end_training_date, start_testing_date, end_testing_date, rolling_window_months):
start_training_date = start_training_date
end_training_date += pd.DateOffset(months=rolling_window_months)
start_testing_date += pd.DateOffset(months=rolling_window_months)
end_testing_date += pd.DateOffset(months=rolling_window_months)
return start_training_date, end_training_date, start_testing_date, end_testing_date
def fetch_this_split_training_indices(df, train_customers_ids, start_training_date, end_training_date):
train_df = df.loc[df['customer_id'].isin(train_customers_ids)]
in_training_period_df = train_df.loc[(train_df['dates'] >= start_training_date) & (train_df['dates'] < end_training_date)]
this_ts_split_training_indices = in_training_period_df.index.to_list()
return this_ts_split_training_indices
def fetch_this_split_testing_indices(df, test_customers_ids, start_testing_date, end_testing_date):
test_df = df.loc[df['customer_id'].isin(test_customers_ids)]
in_testing_period_df = test_df.loc[(test_df['dates'] >= start_testing_date) & (test_df['dates'] < end_testing_date)]
this_ts_split_testing_indices = in_testing_period_df.index.to_list()
return this_ts_split_testing_indices