h2o 如何处理时间序列交叉验证?

机器算法验证 时间序列 交叉验证
2022-04-13 08:44:21

我已阅读有关h2o.r 交叉验证如何工作?. 但是,对于时间序列数据集,H2o 是否支持此处描述的 CV 类型Using k-fold cross-validation for time-series model selection特别是这样的:

fold 1 : training [1], test [2]
fold 2 : training [1 2], test [3]
fold 3 : training [1 2 3], test [4]
fold 4 : training [1 2 3 4], test [5]
fold 5 : training [1 2 3 4 5], test [6]
3个回答

H2O 算法可以选择使用 k 折交叉验证。H2O 尚不支持时间序列(也称为“向前走”或“滚动”)交叉验证,但是这里有一个开放的票证来实现

如果您想尝试一下,有一个示例说明如何使用此处引用的h2o R 包手动实现时间序列 CV。

我使用 Sklearn TimeSeriesSplit 实现了它,如下所示:

from sklearn.model_selection import TimeSeriesSplit
from h2o.estimators import H2ORandomForestEstimator

forest = h2o.estimators.H2ORandomForestEstimator
forest.set_params(nfolds=0)

tscv = TimeSeriesSplit(n_splits=5)

Xcols=list(set(X.names)-set('NumberOfSales'))
Ycol='NumberOfSales'
for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    train = X[min(train_index):max(train_index),:]
    test = X[min(test_index):max(test_index),:]
    print(len(train),len(test)) #Just to double check...
    forest.train(x=Xcols,y=Ycol,
                 training_frame=train,validation_frame=test,verbose=False)            
    y_pred=forest.predict(test[Xcols])
    EVar.append(explained_variance_score(test[Ycol].as_data_frame(), 
                y_pred.as_data_frame()))
    MAEar.append(mean_absolute_error(test[Ycol].as_data_frame(), 
                y_pred.as_data_frame()))
    MSEar.append(mean_squared_error(test[Ycol].as_data_frame(), 
                y_pred.as_data_frame()))
    R2ar.append(r2_score(test[Ycol].as_data_frame(), y_pred.as_data_frame()))

EV = np.array(EVar).mean()
MAE=np.array(MAEar).mean()
MSE=np.array(MSEar).mean()
RMSE=np.array(RMSEar).mean()
R2=np.array(R2ar).mean()
```

另一种交叉验证时间序列的方法,值得分享。特别是因为询问是否H2o可以支持时间序列 cv 的问题。在fold_column变量的帮助下,现有h2o实现能够支持如下所示的时间序列 cv 的变体。

fold 1 : training [4 5 6 7 8 9], test [1 2 3]
fold 2 : training [1 2 3 7 8 9], test [4 5 6]
fold 3 : training [1 2 3 4 5 6], test [7 8 9]

解决方案:

library(h2o)
h2o.init()

airquality$Year <- rep(2017,nrow(airquality))
airquality$Date <- as.Date(with(airquality,paste(Year,Month,Day,sep="-")),"%Y-%m-%d")

df <- as.h2o(airquality[order(as.Date(airquality$Date, format="%m/%d/%Y")),])

df <- h2o.na_omit(df)

# Number of folds
NFOLDS <- 10

# Assign fold number sequentially to a window in data
fold_numbers <- as.h2o((1:nrow(df))%%NFOLDS %>% sort())

# This will assign fold number randomly
#fold_numbers <- h2o.kfold_column(df, nfolds = NFOLDS)

names(fold_numbers) <- "fold_numbers"

# set the predictor names and the response column name
predictors <- c("Solar.R", "Wind", "Temp", "Month", "Day")
response <- "Ozone"

# append the fold_numbers column to the dataset
df <- h2o.cbind(df, fold_numbers)

# try using the fold_column parameter:
airquality_gbm <- h2o.gbm(x = predictors, y = response, training_frame = df,
                    fold_column="fold_numbers", seed = 4)

# print the rmse for your model
print(h2o.rmse(airquality_gbm))

从link1link2借来的部分代码