机器算法验证 - 用于时间序列回归预测的 XG Boost 与随机森林 - 吾爱随笔录

用于时间序列回归预测的 XG Boost 与随机森林

机器算法验证回归机器学习时间序列随机森林助推

2022-04-03 07:11:33

我正在使用 R 的 XGboost 和随机森林实现来生成提前 1 天的收入预测。我有大约 200 行和 50 个预测变量。（随着时间的推移，我有更多的数据，所以更多的行）。

具有以下参数的 XGBoost 模型在均方误差方面比现成的随机森林模型差 6%。此外，随机森林模型比自回归时间序列预测模型更准确。（我还没有尝试过 Arimax）。

对于 Xgboost，我尝试将 eta 更改为 0.02，将 num_rounds 更改为 8,000，但现在运行需要很长时间。是否有某种指南可用于提高 xgboost 模型的预测准确性？我是否正确使用了多核功能？

我觉得我好像在黑暗中获得边际收益。如果有帮助，我正在使用带有 12gb ram 的核心 I7，运行 Windows 7 Professional。感谢您的帮助！

rf.mod = randomForest(act ~ ., data = train)
rf.pred = predict(rf.mod, newdata = test)
#####################################
train_x <- sparse.model.matrix(~., data = train[,2:ncol(train)])
train_y <- train$act
test_x <- sparse.model.matrix(~., data = test)

xgtrain <- xgb.DMatrix(data = train_x, label= train_y)
xgtest <- xgb.DMatrix(data = test_x)

num_rounds <- 1000 

evalgini <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  err <- NormalizedGini(as.numeric(labels),as.numeric(preds))
  return(list(metric = "Gini", value = err))
}
param <- list("objective" = "reg:linear",
              "eta" = 0.2,
              "min_child_weight" = 5,
              "subsample" = .8,
              "colsample_bytree" = .8,
              "scale_pos_weight" = 1.0,
              "max_depth" = 8)
xg.mod <- xgb.train(params = param, data = xgtrain, feval = evalgini, nround=num_rounds, print.every.n = num_rounds, maximize = TRUE)
xg.pred <- predict(xg.mod ,xgtest)

1个回答

处理参数“调整”的最简单方法num_rounds是让 XGBoost 为您完成。您可以在方法中将early_stopping_rounds参数设置为，一旦错误没有减少轮次，模型将停止训练。ntrainn

请参阅Liberty Mutual Kaggle Competition中的这个例子：

如以下代码中所述，您还需要使用该watchlist参数来启用提前停止。

    # You can write R code here and then click "Run" to run it on our platform

# The readr library is the best way to read and write CSV files in R
library(readr)
library(xgboost)
library(data.table)
library(Matrix)
library(caret)

# The competition datafiles are in the directory ../input
# Read competition data files:
train <- read_csv("../input/train.csv")
test <- read_csv("../input/test.csv")

# Generate output files with write_csv(), plot() or ggplot()
# Any files you write to the current directory get shown as outputs

# keep copy of ID variables for test and train data
train_Id <- train$Id
    test_Id <- test$Id

# response variable from training data
train_y <- train$Hazard

# predictor variables from training
train_x <- subset(train, select = -c(Id, Hazard))
train_x <- sparse.model.matrix(~., data = train_x)

# predictor variables from test
test_x <- subset(test, select = -c(Id))
test_x <- sparse.model.matrix(~., data = test_x)

# Set xgboost parameters
param <- list("objective" = "reg:linear",
              "eta" = 0.05,
              "min_child_weight" = 10,
              "subsample" = .8,
              "colsample_bytree" = .8,
              "scale_pos_weight" = 1.0,
              "max_depth" = 5)

# Using 5000 rows for early stopping. 
offset <- 5000
num_rounds <- 1000

# Set xgboost test and training and validation datasets
xgtest <- xgb.DMatrix(data = test_x)
xgtrain <- xgb.DMatrix(data = train_x[offset:nrow(train_x),], label= train_y[offset:nrow(train_x)])
xgval <-  xgb.DMatrix(data = train_x[1:offset,], label= train_y[1:offset])

# setup watchlist to enable train and validation, validation must be first for early stopping
watchlist <- list(val=xgval, train=xgtrain)
# to train with watchlist, use xgb.train, which contains more advanced features

# this will use default evaluation metric = rmse which we want to minimise
bst1 <- xgb.train(params = param, data = xgtrain, nround=num_rounds, print.every.n = 20, watchlist=watchlist, early.stop.round = 50, maximize = FALSE)

其它你可能感兴趣的问题

上一篇帮助向学生解释（提供直觉/示例）评分函数和费希尔的信息下一篇两种使用自举测试两个样本均值差异的方法