数据挖掘 - 如何最小化训练模型的特征？ - 吾爱随笔录

如何最小化训练模型的特征？

数据挖掘推荐系统 xgboost 优化

2022-03-01 05:13:47

我有真正的技术过程，用复杂模型（xgboost）解释。即产品的当前质量 (y) 取决于当前温度 (x1)、压力 (x2) 等。我想解决优化任务：可以选择哪些特征的最小值，产品的质量可以达到最大值？它看起来像简单的优化任务：||y-y0||^2 其中 y - 模型过程的方程和 y0 - 最大值或一些最接近最大值的值。但是不可能得到xgboost的加权系数，所以我不能使用skopt，即使我能得到系数，真正的方程也会很困难。我现在唯一的决定是整理所有可能特征的所有可能值，对这些特征进行预测并选择最优值，如果 y 将达到最大值或接近它。能给个建议吗

1个回答

有几种算法可以以一种聪明的方式帮助你。

通常，这些算法用于调整模型的超参数，因此您可以在教程/示例中找到这些算法。在您的情况下，您必须找到一组好的特征而不是一组好的超参数，但原理是一样的。

我的建议：

1）SMAC。这是基于贝叶斯优化。这是一个迭代过程，其中构建和最大化代理功能：

要优化的函数（您的 XGBoost 模型）在优化器认为可以找到最大值的点（在特征的超空间中）进行评估（或者，在第一次迭代中，在用户给出的点中）；
将结果添加到所有评估点的集合中，该集合用于构建代理功能；
代理函数被最大化，并且该最大值的坐标被认为与原始函数也将具有最大值的位置相同。

这三个步骤可以根据需要重复。所以，从第一步开始重复；

它适用于连续特征和分类特征，您还可以在特征之间施加一些约束。

这是您的案例的示例，在 Python 中（代码未测试）：

from smac.configspace import ConfigurationSpace
from ConfigSpace.hyperparameters import UniformFloatHyperparameter, UniformIntegerHyperparameter
from smac.scenario.scenario import Scenario
from smac.facade.smac_facade import SMAC

#a continuous feature that you know has to lie in the [25 ~ 40] range
cont_feat = UniformFloatHyperparameter("a_cont_feature", 25., 40., default_value=35.)

#another continuous feature, [0.05 ~ 4] range
cont_feat2 = UniformFloatHyperparameter("another_cont_feature", 0.05, 4, default_value=1)


#a binary feature
bin_feat = UniformIntegerHyperparameter("a_bin_feature", 0, 1, default_value=1)

#the configuration space where to search for the maxima
cs = ConfigurationSpace()

cs.add_hyperparameters([cont_feat, cont_feat2, bin_feat])


# Scenario object
scenario = Scenario({"run_obj": "quality",   # we optimize quality
                     "runcount-limit": 1000,  # maximum function evaluations
                     "cs": cs,               # the configuration space
                     "cutoff_time": None
                     })

#here we include the XGBoost model
def f_to_opt(cfg):

    #here be careful! Your features need to be in the correct order for a correct evaluation of the XGB model
    features = {k : cfg[k] for k in cfg if cfg[k]}
    prediction = model.predict(features)

    return prediction


smac = SMAC(scenario=scenario, rng=np.random.RandomState(42),
        tae_runner=f_to_opt)
opt_feat_set = smac.optimize()

#the set of features which maximize the output
print (opt_feat_set)

2）dlib优化。这比以前的收敛速度快得多。作为免责声明，我不得不说这是一种原则上只适用于满足特定条件的函数的算法，而作为函数的 XGBoost 模型则不能。但实际上事实证明，这个过程也适用于不太严格的功能，至少在我尝试过的情况下。所以也许你也想试试。

示例代码：

import dlib

#here we include the XGBoost model. Note that we cannot use categorical/integer/binary features
def f_to_opt(cont_feat, cont_feat2):
    return model.predict([cont_feat, cont_feat2])


x,y = dlib.find_max_global(holder_table, 
                           [25, 0.05],  # Lower bound constraints on cont_feat and cont_feat2 respectively
                           [40, 4],    # Upper bound constraints on cont_feat and cont_feat2 respectively
                           1000)         # The number of times find_max_global() will call  f_to_opt

其它你可能感兴趣的问题

上一篇在 PyTorch 词嵌入上启用小批量处理下一篇如何使用 rnn 预测数据有限的 n 个周期？