@DennisSoemers 有一个很好的解决方案。我将添加两个更明确的类似解决方案,它们基于特征工程和选择: Max Kuhn 和 Kjell Johnson的预测模型的实用方法。
Kuhn 使用该术语resample
来描述fold
数据集的 a,但 StackExchange 上的主要术语似乎是fold
,因此我将在fold
下面使用该术语。
选项 1 - 嵌套搜索
如果计算能力不是限制因素,则建议使用嵌套验证方法,其中有 3 级嵌套:
1)外部折叠,每个折叠具有不同的特征子集
2)内部折叠,每个折叠都有一个超参数搜索
3)每个超参数搜索的内部折叠,每个折叠都有不同的超参数集。
这是算法:
-> Split data into train and test sets.
-> For each external fold of train set:
-> Select feature subset.
-> Split into external train and test sets.
-> For each internal fold of external train set:
-> Split into internal train and test sets.
-> Perform hyperparameter tuning on the internal train set. Note that this
step is another level of nesting in which the internal train set is split
into multiple folds and different hyperparameter sets are trained and tested on
different folds.
-> Examine the performance of the best hyperparameter tuned model
from each of the inner test folds. If performance is consistent, redo
the internal hyperparameter tuning step on the entire external train set.
-> Test the model with the best hyperparameter set on the external test set.
-> Choose the feature set with the best external test score.
-> Retrain the model on all of the training data using the best feature set
and best hyperparameters for that feature set.
图片来自第 11.2 章:简单过滤器
暗示该-> Select feature subset
步骤是随机的,但还有其他技术,在本书第 11 章中进行了概述。
为了澄清这-> Perform hyperparameter tuning step
一点,您可以阅读有关嵌套交叉验证的推荐方法。这个想法是通过对数据的不同折叠重复执行训练和测试过程并查看测试结果的平均值来测试训练过程的稳健性。
选项 2 - 单独的超参数和特征选择搜索
-> Split data into hyperameter_train, feature_selection_train, and test sets.
-> Select a reasonable subset of features using expert knowledge.
-> Perform nested cross validation with the initial features and the
hyperparameter_train set to find the best hyperparameters as outlined in option 1.
-> Use the best hyperparameters and the feature_selection_train set to find
the best set of features. Again, this process could be nested cross
validation or not, depending on the computational cost that it would take
and the cost that is tolerable.
以下是 Kuhn 和 Johsnon 对这个过程的表述:
在将全局搜索方法与具有调整参数的模型相结合时,我们建议在可能的情况下,首先使用有关问题的专家知识来筛选特征集。接下来,重要的是确定合理的调整参数值范围。如果有足够数量的样本可用,则可以将其中的一部分拆分出来,并用于使用所有特征找到一系列潜在的良好参数值。调整参数值可能不是特征子集的完美选择,但它们对于找到最佳子集应该是相当有效的。
第 12.5 章:全局搜索方法