使用拉丁超立方抽样 (LHS) 从参数组合的大矩阵/网格中选择超参数

数据挖掘 r xgboost 超参数
2022-02-17 02:34:53

我有一个矩阵,每一行对应于 XGBoost 模型的一个超参数。XGBoost 中有七个参数需要调整(如下所示:nrounds/iterations、max_depth、eta、gamma、colsample_byTree、min_child_weight 和 subsample)。我进行了文献综述,以指定每个参数的值的范围和区间。使用这些范围和间隔,参数空间生成了大约 62,500 个参数组合。我正在使用 R caret::train 函数为我的数据集生成最佳超参数组合。但是,模拟量(62,500)太多了。我读到了拉丁超立方体采样(LHS),我认为这是我需要通过使用 LHS 应用超参数的初始选择来减少模拟次数的内容。但我无法在我的数据集中实施该方法。我的目标是使用 LHS 生成可管理数量的超参数组合(即约 500 个),然后使用 caret::train 函数选择最佳参数。我想在使用我的参数空间实现 LHS 方面寻求帮助。

nrounds <- seq(from = 200, to = 1000, by = 200) 
maxdepth <- seq(from = 2, to = 10, by = 2)
eta <- c(0.01, 0.05, 0.1, 0.2, 0.3)
gamma <- seq(from = 0, to = 0.4, by = 0.1)
colsample_bytree <- seq(from = 0.4, to = 1, by = 0.2)
min_child_weight <- seq(from = 1, to = 5, by = 1)
subsample <- seq(from = 0.6, to = 1, by = 0.1)
dataGrid <- expand.grid(nrounds, maxdepth, eta, gamma, colsample_bytree, min_child_weight, subsample)
2个回答
library(tidymodels)
    
xgboost_set <- param_set(list(learn_rate(range = c(0.01,0.3), trans = NULL),
                             trees(range = c(200,1000), trans = NULL), #trees(): The number of trees contained in a random forest or boosted ensemble. In the latter case, this is equal to the number of boosting iterations
                             loss_reduction(range = c(0,0.4), trans = NULL), #This corresponds to gamma in xgboost
                             tree_depth(range = c(2,10), trans = NULL),
                             min_n(range = c(1,5), trans = NULL), # assume is same with min_child_weight parameter in boosting trees
                             sample_prop(range = c(0.4,1), trans = NULL) # assume is same with min_child_weight parameter in boosting trees
                             ))
        
        # regularization_factor(range = c(0,0.4), trans = NULL),
        set.seed(463)
        me_grid <- grid_max_entropy(xgboost_set, size = 200) %>% mutate(type = "max entropy")
        ls_grid <- grid_latin_hypercube(xgboost_set, size = 200) %>% mutate(type = "latin hypercube")
        rn_grid <- grid_random(xgboost_set, size = 200) %>% mutate(type = "random")

谢谢

dialsfromTidymodels有一个grid_latin_hypercube函数可以用于这个https://dials.tidymodels.org/reference/grid_max_entropy.html