与时间相关的预测变量进行交叉验证的更好方法是什么

数据挖掘 机器学习 预测建模 时间序列 交叉验证
2021-10-08 13:38:25

我是一个给定的数据集,其中包含有关商店的不同预测因子,其想法是预测每日购物者的数量。预测变量是工作日、一天中的时间(早上、下午、晚上)、周数、月份、天气(湿度、露点、温度)、假期。结果变量是访问者的数量。

我想建立一个回归模型来预测使用传统机器学习算法(如随机森林、SVM 等)的访客数量。

我主要关心的是如何使用 CV 验证这个模型,因为一些预测变量是时间相关的。普通的 CV 不能在这里执行。在这个问题中,他们提出了一种执行此操作的方法,但我的问题是我只有 2015 年 6 月至今的数据。

我最初的想法是:

  • 使用 2015 年 6 月至 2015 年 12 月的数据进行训练。使用 1 月进行测试
  • 使用 2015 年 6 月至 2016 年 1 月进行训练。使用 2016 年 2 月进行测试。

每次在评估该月的错误后,将一个月的数据添加到训练数据中。然后计算平均性能。

我的问题:

  1. 这种方法是否合理?

  2. 如果是这样,我应该摆脱月份变量吗?请注意,例如,在 a. 中,我正在使用一些属于用于训练的不同月份的数据进行测试。我的意思是,我使用 2015 年 6 月到 2015 年 12 月的数据进行培训,但我正在测试 2016 年 1 月的数据。季节性可能是我缺少的东西。

  3. 一般如何验证这些模型?

2个回答

您的方法 1)是正确的,使用 n 个有序数据来预测 n+1。如果您觉得您拥有的数据量很少,您将必须确定正确的预测窗口和不太灵活的模型。

不要忘记特征工程和数据准备如果您正确识别它,您提到的季节性可以被删除。

模型的验证照常进行,带有平方损失函数

处理时间序列交叉验证的一种方法是从此处查看以下 Python 代码:

def performTimeSeriesCV(X_train, y_train, number_folds, algorithm, parameters):
"""
Given X_train and y_train (the test set is excluded from the Cross Validation),
number of folds, the ML algorithm to implement and the parameters to test,
the function acts based on the following logic: it splits X_train and y_train in a
number of folds equal to number_folds. Then train on one fold and tests accuracy
on the consecutive as follows:
- Train on fold 1, test on 2
- Train on fold 1-2, test on 3
- Train on fold 1-2-3, test on 4
....
Returns mean of test accuracies.
"""

print 'Parameters --------------------------------> ', parameters
print 'Size train set: ', X_train.shape

# k is the size of each fold. It is computed dividing the number of 
# rows in X_train by number_folds. This number is floored and coerced to int
k = int(np.floor(float(X_train.shape[0]) / number_folds))
print 'Size of each fold: ', k

# initialize to zero the accuracies array. It is important to stress that
# in the CV of Time Series if I have n folds I test n-1 folds as the first
# one is always needed to train
accuracies = np.zeros(folds-1)

# loop from the first 2 folds to the total number of folds    
for i in range(2, number_folds + 1):
    print ''

    # the split is the percentage at which to split the folds into train
    # and test. For example when i = 2 we are taking the first 2 folds out 
    # of the total available. In this specific case, we have to split the
    # two of them in half (train on the first, test on the second), 
    # so split = 1/2 = 0.5 = 50%. When i = 3 we are taking the first 3 folds 
    # out of the total available, meaning that we have to split the three of them
    # in two at split = 2/3 = 0.66 = 66% (train on the first 2 and test on the
    # following)
    split = float(i-1)/i

    # example with i = 4 (first 4 folds):
    #      Splitting the first       4        chunks at          3      /        4
    print 'Splitting the first ' + str(i) + ' chunks at ' + str(i-1) + '/' + str(i) 

    # as we loop over the folds X and y are updated and increase in size.
    # This is the data that is going to be split and it increases in size 
    # in the loop as we account for more folds. If k = 300, with i starting from 2
    # the result is the following in the loop
    # i = 2
    # X = X_train[:(600)]
    # y = y_train[:(600)]
    #
    # i = 3
    # X = X_train[:(900)]
    # y = y_train[:(900)]
    # .... 
    X = X_train[:(k*i)]
    y = y_train[:(k*i)]
    print 'Size of train + test: ', X.shape # the size of the dataframe is going to be k*i

    # X and y contain both the folds to train and the fold to test.
    # index is the integer telling us where to split, according to the
    # split percentage we have set above
    index = int(np.floor(X.shape[0] * split))

    # folds used to train the model        
    X_trainFolds = X[:index]        
    y_trainFolds = y[:index]

    # fold used to test the model
    X_testFold = X[(index + 1):]
    y_testFold = y[(index + 1):]

    # i starts from 2 so the zeroth element in accuracies array is i-2. performClassification() is a function which takes care of a classification problem. This is only an example and you can replace this function with whatever ML approach you need.
    accuracies[i-2] = performClassification(X_trainFolds, y_trainFolds, X_testFolds, y_testFolds, algorithm, parameters)

    # example with i = 4:
    #      Accuracy on fold         4     :    0.85423
    print 'Accuracy on fold ' + str(i) + ': ', acc[i-2]

# the function returns the mean of the accuracy on the n-1 folds    
return accuracies.mean()

另一方面,如果您更喜欢 R,则可以探索caret 包中的时间片方法并使用以下代码:

library(caret) 
library(ggplot2) 
data(economics) 
myTimeControl <- trainControl(method = "timeslice",
                              initialWindow = 36,
                              horizon = 12,
                              fixedWindow = TRUE)

plsFitTime <- train(unemploy ~ pce + pop + psavert,
                    data = economics,
                    method = "pls",
                    preProc = c("center", "scale"),
                    trControl = myTimeControl)