处理时间序列交叉验证的一种方法是从此处查看以下 Python 代码:
def performTimeSeriesCV(X_train, y_train, number_folds, algorithm, parameters):
"""
Given X_train and y_train (the test set is excluded from the Cross Validation),
number of folds, the ML algorithm to implement and the parameters to test,
the function acts based on the following logic: it splits X_train and y_train in a
number of folds equal to number_folds. Then train on one fold and tests accuracy
on the consecutive as follows:
- Train on fold 1, test on 2
- Train on fold 1-2, test on 3
- Train on fold 1-2-3, test on 4
....
Returns mean of test accuracies.
"""
print 'Parameters --------------------------------> ', parameters
print 'Size train set: ', X_train.shape
# k is the size of each fold. It is computed dividing the number of
# rows in X_train by number_folds. This number is floored and coerced to int
k = int(np.floor(float(X_train.shape[0]) / number_folds))
print 'Size of each fold: ', k
# initialize to zero the accuracies array. It is important to stress that
# in the CV of Time Series if I have n folds I test n-1 folds as the first
# one is always needed to train
accuracies = np.zeros(folds-1)
# loop from the first 2 folds to the total number of folds
for i in range(2, number_folds + 1):
print ''
# the split is the percentage at which to split the folds into train
# and test. For example when i = 2 we are taking the first 2 folds out
# of the total available. In this specific case, we have to split the
# two of them in half (train on the first, test on the second),
# so split = 1/2 = 0.5 = 50%. When i = 3 we are taking the first 3 folds
# out of the total available, meaning that we have to split the three of them
# in two at split = 2/3 = 0.66 = 66% (train on the first 2 and test on the
# following)
split = float(i-1)/i
# example with i = 4 (first 4 folds):
# Splitting the first 4 chunks at 3 / 4
print 'Splitting the first ' + str(i) + ' chunks at ' + str(i-1) + '/' + str(i)
# as we loop over the folds X and y are updated and increase in size.
# This is the data that is going to be split and it increases in size
# in the loop as we account for more folds. If k = 300, with i starting from 2
# the result is the following in the loop
# i = 2
# X = X_train[:(600)]
# y = y_train[:(600)]
#
# i = 3
# X = X_train[:(900)]
# y = y_train[:(900)]
# ....
X = X_train[:(k*i)]
y = y_train[:(k*i)]
print 'Size of train + test: ', X.shape # the size of the dataframe is going to be k*i
# X and y contain both the folds to train and the fold to test.
# index is the integer telling us where to split, according to the
# split percentage we have set above
index = int(np.floor(X.shape[0] * split))
# folds used to train the model
X_trainFolds = X[:index]
y_trainFolds = y[:index]
# fold used to test the model
X_testFold = X[(index + 1):]
y_testFold = y[(index + 1):]
# i starts from 2 so the zeroth element in accuracies array is i-2. performClassification() is a function which takes care of a classification problem. This is only an example and you can replace this function with whatever ML approach you need.
accuracies[i-2] = performClassification(X_trainFolds, y_trainFolds, X_testFolds, y_testFolds, algorithm, parameters)
# example with i = 4:
# Accuracy on fold 4 : 0.85423
print 'Accuracy on fold ' + str(i) + ': ', acc[i-2]
# the function returns the mean of the accuracy on the n-1 folds
return accuracies.mean()
另一方面,如果您更喜欢 R,则可以探索caret 包中的时间片方法并使用以下代码:
library(caret)
library(ggplot2)
data(economics)
myTimeControl <- trainControl(method = "timeslice",
initialWindow = 36,
horizon = 12,
fixedWindow = TRUE)
plsFitTime <- train(unemploy ~ pce + pop + psavert,
data = economics,
method = "pls",
preProc = c("center", "scale"),
trControl = myTimeControl)