数据挖掘 - 使用 scikit-hts 进行分组时间序列预测 - 吾爱随笔录

我正在尝试预测我从 kaggle 的商店项目需求预测挑战中获得的多个时间序列的销售额。它由 10 家商店和 50 个商品的长格式时间序列组成，从而导致 500 个时间序列相互堆叠。对于每家商店和每件商品，我都有 5 年的每日记录，包括每周和每年的季节性。

总共有：365.2days * 5years * 10stores *50items = 913000 条记录。

根据我目前对Hierarchical and Grouped time series的理解，整个数据帧可以构造为 Grouped Time Series 而不仅仅是一个严格的 Hierarchical Time Series，因为聚合可以在 store 或 item 级别完成可互换。

我想找到一种方法来使用 scikit-hts 库及其下一年（从 2015 年 1 月 1 日到 2015 年 12 月 31 日）预测所有 500 个时间序列（对于 store1_item1、store1_item2、...、store10_item50） AutoArimaModel 函数，它是 pmdarima 的 AutoArima 函数的包装函数。

为了处理两个季节性水平，我添加了傅立叶项作为外生特征来处理年度季节性，而 auto_arima 处理每周季节性。

我的问题是我在预测步骤中遇到了错误。

这是错误消息：

ValueError：假设外生值的形状不合适。需要 (365, 4)，得到 (365, 8)。

我认为外生字典有问题，但我不知道如何解决这个问题，因为我是第一次使用 scikit-hts。为此，我在这里遵循了 scikit-hts 的官方文档。

编辑：______________________________________________________________

我还没有看到在Github上报告过类似的错误。按照我在本地实施的建议修复，我可以得到一些结果。然而，即使运行代码时没有错误，一些预测是负面的，正如这篇文章下面的评论中提出的那样。我们甚至会得到不成比例的积极价值。

以下是 store 和 item 的所有组合的图。您可以看到这似乎只适用于一种组合。

df.loc['2014','store_1_item_1'].plot()
predictions.loc['2015','store_1_item_1'].plot()

df.loc['2014','store_1_item_2'].plot()
predictions.loc['2015','store_1_item_2'].plot()

df.loc['2014','store_2_item_1'].plot()
predictions.loc['2015','store_2_item_1'].plot()

df.loc['2014','store_2_item_2'].plot()
predictions.loc['2015','store_2_item_2'].plot()

_____________________________________________________________________

完整代码：

# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
import hts
from hts.hierarchy import HierarchyTree
from hts.model import AutoArimaModel
from hts import HTSRegressor


# read data from the csv file
data = pd.read_csv('train.csv', index_col='date', parse_dates=True)

# Train/Test split with reduced size
train_data = data.query('store == [1,2] and item == [1, 2]').loc['2013':'2014']
test_data = data.query('store == [1,2] and item == [1, 2]').loc['2015']


# Create the stores time series
# For each timestamp group by store and apply sum
stores_ts = train_data.drop(columns=['item']).groupby(['date','store']).sum()
stores_ts = stores_ts.unstack('store')
stores_ts.columns = stores_ts.columns.droplevel(0)
stores_ts.columns = ['store_' + str(i) for i in stores_ts.columns]

# Create the items time series
# For each timestamp group by item and apply sum
items_ts = train_data.drop(columns=['store']).groupby(['date','item']).sum()
items_ts = items_ts.unstack('item')
items_ts.columns = items_ts.columns.droplevel(0)
items_ts.columns = ['item_' + str(i) for i in items_ts.columns]


# Create the stores_items time series
# For each timestamp group by store AND by item and apply sum
store_item_ts = train_data.pivot_table(index= 'date', columns=['store', 'item'], aggfunc='sum')
store_item_ts.columns = store_item_ts.columns.droplevel(0)

# Rename the columns as store_i_item_j
col_names = []
for i in store_item_ts.columns:
    col_name = 'store_' + str(i[0]) + '_item_' + str(i[1])
    col_names.append(col_name)
    
store_item_ts.columns = store_item_ts.columns.droplevel(0)
store_item_ts.columns = col_names

# Create a new dataframe and add the root level of the hierarchy as the sum of all stores (or all items)
df = pd.DataFrame()
df['total'] = stores_ts.sum(1) 

# Concatenate all created dataframes into one df
# df is the dataframe that will be used for model training
df = pd.concat([df, stores_ts, items_ts, store_item_ts], 1)


# Build fourier terms for train and test sets
four_terms = FourierFeaturizer(365.2, 1)

# Build the exogenous features dataframe for training data
exog_train_df = pd.DataFrame()

for i in range(1, 3):
    for j in range(1, 3):
        _, exog = four_terms.fit_transform(train_data.query(f'store == {i} and item == {j}').sales)
        exog.columns= [f'store_{i}_item_{j}_'+ x for x in exog.columns]
        exog_train_df = pd.concat([exog_train_df, exog], axis=1)
exog_train_df['date'] = df.index
exog_train_df.set_index('date', inplace=True)

# add the exogenous features dataframe to df before training
df = pd.concat([df, exog_train_df], axis= 1)


# Build the exogenous features dataframe for test set
# It will be used only when using model.predict()
exog_test_df = pd.DataFrame()

for i in range(1, 3):
    for j in range(1, 3):
        _, exog_test = four_terms.fit_transform(test_data.query(f'store == {i} and item == {j}').sales)
        exog_test.columns= [f'store_{i}_item_{j}_'+ x for x in exog_test.columns]
        exog_test_df = pd.concat([exog_test_df, exog_test], axis=1)


# Build the hierarchy of the Grouped Time Series
stores = [i for i in stores_ts.columns]
items = [i for i in items_ts.columns]
store_items = col_names

# Exogenous features mapping
exog_store_items = {e: [v for v in exog_train_df.columns if v.startswith(e)] for e in store_items}  
exog_stores = {e:[v for v in exog_train_df.columns if v.startswith(e)] for e in stores}
exog_items = {e:[v for v in exog_train_df.columns if v.find(e) != -1] for e in items}
exog_total = {'total':[v for v in exog_train_df.columns if v.find('FOURIER') != -1]}

# Merge all dictionaries
exog_to_merge = [exog_store_items, exog_stores, exog_items, exog_total]
exogenous = {k:v for x in exog_to_merge for k,v in x.items()}

# Build hierarchy
total = {'total': stores + items}
store_h = {k: [v for v in store_items if v.startswith(k)] for k in stores}
hierarchy = {**total, **store_h}

# Hierarchy tree automatically created by hts
ht = HierarchyTree.from_nodes(nodes=hierarchy, df=df, exogenous=exogenous)

# Instanciate the auto arima model using HTSRegressor
autoarima = HTSRegressor(model='auto_arima', D=1, m=7, seasonal=True, revision_method='OLS', n_jobs=12)

# Fit the model to the training df that includes time series and exog_train_df
# Set exogenous param to the previously built dictionary
model = autoarima.fit(df, hierarchy, exogenous=exogenous)

# Make predictions
# Set the exogenous_df param 
predictions = model.predict(exogenous_df=exog_test_df, steps_ahead=365)

我想到的并且我已经为一个系列成功实施的其他方法（例如，对于商店 1 和项目 1）：

TBATS 在所有 500 个时间序列的循环内独立应用于每个序列
auto_arima (SARIMAX) 具有独立的每个系列的外生特征（=处理每周和每年季节性的傅里叶项）+ 一个跨所有 500 个时间序列的循环

您如何看待这些方法？关于如何将 ARIMA 扩展到多个时间序列，您还有其他建议吗？

我也想尝试 LSTM，但我是数据科学和深度学习的新手，不知道如何准备数据。我应该将数据保留为原始形式（长格式）并对 train_data['store'] 和 train_data['item'] 列应用一种热编码，还是应该从我在这里结束的 df 开始？