使用 scikit-hts 进行分组时间序列预测

数据挖掘 Python 时间序列 lstm 预测 有马
2022-02-12 11:22:23

我正在尝试预测我从 kaggle 的商店项目需求预测挑战中获得的多个时间序列的销售额。它由 10 家商店和 50 个商品的长格式时间序列组成,从而导致 500 个时间序列相互堆叠。对于每家商店和每件商品,我都有 5 年的每日记录,包括每周和每年的季节性

总共有:365.2days * 5years * 10stores *50items = 913000 条记录。

根据我目前对Hierarchical and Grouped time series的理解,整个数据帧可以构造为 Grouped Time Series 而不仅仅是一个严格的 Hierarchical Time Series,因为聚合可以在 store 或 item 级别完成可互换。

我想找到一种方法来使用 scikit-hts 库及其下一年(从 2015 年 1 月 1 日到 2015 年 12 月 31 日)预测所有 500 个时间序列(对于 store1_item1、store1_item2、...、store10_item50) AutoArimaModel 函数,它是 pmdarima 的 AutoArima 函数的包装函数。

为了处理两个季节性水平,我添加了傅立叶项作为外生特征来处理年度季节性,而 auto_arima 处理每周季节性。

我的问题是我在预测步骤中遇到了错误。

这是错误消息:

ValueError:假设外生值的形状不合适。需要 (365, 4),得到 (365, 8)。

我认为外生字典有问题,但我不知道如何解决这个问题,因为我是第一次使用 scikit-hts。为此,我在这里遵循了 scikit-hts 的官方文档。

编辑 :______________________________________________________________

我还没有看到在Github上报告过类似的错误。按照我在本地实施的建议修复,我可以得到一些结果。然而,即使运行代码时没有错误,一些预测是负面的,正如这篇文章下面的评论中提出的那样。我们甚至会得到不成比例的积极价值。

以下是 store 和 item 的所有组合的图。您可以看到这似乎只适用于一种组合。

df.loc['2014','store_1_item_1'].plot()
predictions.loc['2015','store_1_item_1'].plot()

在此处输入图像描述

df.loc['2014','store_1_item_2'].plot()
predictions.loc['2015','store_1_item_2'].plot()

在此处输入图像描述

df.loc['2014','store_2_item_1'].plot()
predictions.loc['2015','store_2_item_1'].plot()

在此处输入图像描述

df.loc['2014','store_2_item_2'].plot()
predictions.loc['2015','store_2_item_2'].plot()

在此处输入图像描述

_____________________________________________________________________

完整代码:

# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
import hts
from hts.hierarchy import HierarchyTree
from hts.model import AutoArimaModel
from hts import HTSRegressor


# read data from the csv file
data = pd.read_csv('train.csv', index_col='date', parse_dates=True)

# Train/Test split with reduced size
train_data = data.query('store == [1,2] and item == [1, 2]').loc['2013':'2014']
test_data = data.query('store == [1,2] and item == [1, 2]').loc['2015']


# Create the stores time series
# For each timestamp group by store and apply sum
stores_ts = train_data.drop(columns=['item']).groupby(['date','store']).sum()
stores_ts = stores_ts.unstack('store')
stores_ts.columns = stores_ts.columns.droplevel(0)
stores_ts.columns = ['store_' + str(i) for i in stores_ts.columns]

# Create the items time series
# For each timestamp group by item and apply sum
items_ts = train_data.drop(columns=['store']).groupby(['date','item']).sum()
items_ts = items_ts.unstack('item')
items_ts.columns = items_ts.columns.droplevel(0)
items_ts.columns = ['item_' + str(i) for i in items_ts.columns]


# Create the stores_items time series
# For each timestamp group by store AND by item and apply sum
store_item_ts = train_data.pivot_table(index= 'date', columns=['store', 'item'], aggfunc='sum')
store_item_ts.columns = store_item_ts.columns.droplevel(0)

# Rename the columns as store_i_item_j
col_names = []
for i in store_item_ts.columns:
    col_name = 'store_' + str(i[0]) + '_item_' + str(i[1])
    col_names.append(col_name)
    
store_item_ts.columns = store_item_ts.columns.droplevel(0)
store_item_ts.columns = col_names

# Create a new dataframe and add the root level of the hierarchy as the sum of all stores (or all items)
df = pd.DataFrame()
df['total'] = stores_ts.sum(1) 

# Concatenate all created dataframes into one df
# df is the dataframe that will be used for model training
df = pd.concat([df, stores_ts, items_ts, store_item_ts], 1)


# Build fourier terms for train and test sets
four_terms = FourierFeaturizer(365.2, 1)

# Build the exogenous features dataframe for training data
exog_train_df = pd.DataFrame()

for i in range(1, 3):
    for j in range(1, 3):
        _, exog = four_terms.fit_transform(train_data.query(f'store == {i} and item == {j}').sales)
        exog.columns= [f'store_{i}_item_{j}_'+ x for x in exog.columns]
        exog_train_df = pd.concat([exog_train_df, exog], axis=1)
exog_train_df['date'] = df.index
exog_train_df.set_index('date', inplace=True)

# add the exogenous features dataframe to df before training
df = pd.concat([df, exog_train_df], axis= 1)


# Build the exogenous features dataframe for test set
# It will be used only when using model.predict()
exog_test_df = pd.DataFrame()

for i in range(1, 3):
    for j in range(1, 3):
        _, exog_test = four_terms.fit_transform(test_data.query(f'store == {i} and item == {j}').sales)
        exog_test.columns= [f'store_{i}_item_{j}_'+ x for x in exog_test.columns]
        exog_test_df = pd.concat([exog_test_df, exog_test], axis=1)


# Build the hierarchy of the Grouped Time Series
stores = [i for i in stores_ts.columns]
items = [i for i in items_ts.columns]
store_items = col_names

# Exogenous features mapping
exog_store_items = {e: [v for v in exog_train_df.columns if v.startswith(e)] for e in store_items}  
exog_stores = {e:[v for v in exog_train_df.columns if v.startswith(e)] for e in stores}
exog_items = {e:[v for v in exog_train_df.columns if v.find(e) != -1] for e in items}
exog_total = {'total':[v for v in exog_train_df.columns if v.find('FOURIER') != -1]}

# Merge all dictionaries
exog_to_merge = [exog_store_items, exog_stores, exog_items, exog_total]
exogenous = {k:v for x in exog_to_merge for k,v in x.items()}

# Build hierarchy
total = {'total': stores + items}
store_h = {k: [v for v in store_items if v.startswith(k)] for k in stores}
hierarchy = {**total, **store_h}

# Hierarchy tree automatically created by hts
ht = HierarchyTree.from_nodes(nodes=hierarchy, df=df, exogenous=exogenous)

# Instanciate the auto arima model using HTSRegressor
autoarima = HTSRegressor(model='auto_arima', D=1, m=7, seasonal=True, revision_method='OLS', n_jobs=12)

# Fit the model to the training df that includes time series and exog_train_df
# Set exogenous param to the previously built dictionary
model = autoarima.fit(df, hierarchy, exogenous=exogenous)

# Make predictions
# Set the exogenous_df param 
predictions = model.predict(exogenous_df=exog_test_df, steps_ahead=365)

我想到的并且我已经为一个系列成功实施的其他方法(例如,对于商店 1 和项目 1):

  • TBATS 在所有 500 个时间序列的循环内独立应用于每个序列

  • auto_arima (SARIMAX) 具有独立的每个系列的外生特征(=处理每周和每年季节性的傅里叶项)+ 一个跨所有 500 个时间序列的循环

您如何看待这些方法?关于如何将 ARIMA 扩展到多个时间序列,您还有其他建议吗?

我也想尝试 LSTM,但我是数据科学和深度学习的新手,不知道如何准备数据。我应该将数据保留为原始形式(长格式)并对 train_data['store'] 和 train_data['item'] 列应用一种热编码,还是应该从我在这里结束的 df 开始?

0个回答
没有发现任何回复~