如何在 sklearn 管道的阶段传递参数?

数据挖掘 Python scikit-学习 超参数调整 网格搜索 管道
2021-10-11 14:13:39

我正在研究使用 Keras 进行文本分类的深度神经模型。为了微调一些超参数,我将 Keras Wrappers 用于 Scikit-Learn API。所以我为此构建了一个 Sklearn 管道:

def create_model(optimizer="adam", nbr_features=100):
    model = Sequential()
    model.add(Dense(512, activation='relu', input_shape=(nbr_features,)))
    ...
    model.compile(loss='binary_crossentropy', optimizer=optimizer,metrics=["accuracy"])
    return model

estimator = Pipeline([("tfidf", TfidfVectorizer()),
                      ('norm', StandardScaler(with_mean=False)),
                      ("km", KerasClassifier(build_fn=create_model, verbose=1))])
grid_params = {
     'tfidf__max_df': (0.1, 0.25, 0.5, 0.75, 1.0),
     'tfidf__max_features': (100, 500, 1000, 5000,),
      ... }

gs = GridSearchCV(estimator,
                   param_grid,
                   ...)

我想将max_features参数从tfidf阶段传递到km阶段作为nbr_features任何黑客/解决方法可以做到这一点?

1个回答

我想出了如何通过猴子修补ParameterGrid.__iter__GridSearchCV._run_search方法来做到这一点。

ParameterGrid.__iter__迭代所有可能的超参数组合(参数名称的字典:值)。所以我通过添加等于'tfidf__max_features' 的“km__nbr_features”来修改它产生的内容(超参数 params的一种配置) :

params["km__nbr_features"] = params['tfidf__max_features']

重要提示: grid_params中必须缺少“km__nbr_features” 因此该技巧有效。

这是一些代码:

from sklearn.model_selection import GridSearchCV, ParameterGrid
import numpy as np
from itertools import product

def patch_params(params):
    # Updates a configuration of possible parameters
    params["km__nbr_features"] = params['tfidf__max_features']
    return out

def monkey_iter__(self):
    """Iterate over the points in the grid.

    Returns
    -------
    params : iterator over dict of string to any
        Yields dictionaries mapping each estimator parameter to one of its
        allowed values.
    """
    for p in self.param_grid:
        # Always sort the keys of a dictionary, for reproducibility
        items = sorted(p.items())
        if not items:
            yield {}
        else:
            keys, values = zip(*items)
            for v in product(*values):
                params = dict(zip(keys, v))
                yield patch_params(params)


# replacing address of "__getitem__" with "monkey_getitem__"
ParameterGrid.__iter__  = monkey_iter__

def monkey_run_search(self, evaluate_candidates):
    """Search all candidates in param_grid"""
    evaluate_candidates(ParameterGrid(self.param_grid))

# replacing address of "_run_search " with "monkey_run_search"
GridSearchCV._run_search = monkey_run_search

然后我正常执行网格搜索:

def create_model(optimizer="adam", nbr_features=100):
    model = Sequential()
    model.add(Dense(512, activation='relu', input_shape=(nbr_features,)))
    ...
    model.compile(loss='binary_crossentropy', optimizer=optimizer,metrics=["accuracy"])
    return model

estimator = Pipeline([("tfidf", TfidfVectorizer()),
                      ('norm', StandardScaler(with_mean=False)),
                      ("km", KerasClassifier(build_fn=create_model, verbose=1))])
grid_params = {
     'tfidf__max_df': (0.1, 0.25, 0.5, 0.75, 1.0),
     'tfidf__max_features': (100, 500, 1000, 5000,),
      ... }

# Performing Grid Search
gs = GridSearchCV(estimator,
                   param_grid,
                   ...)

更新: 如果你使用了RandomizedGridSearchCV,你必须修改 ParameterGrid.__getitem__ insted

def monkey_getitem__(self, ind):
    """Get the parameters that would be ``ind``th in iteration
    Parameters
    ----------
    ind : int
        The iteration index
    Returns
    -------
    params : dict of string to any
        Equal to list(self)[ind]
    """
    # This is used to make discrete sampling without replacement memory
    # efficient.
    for sub_grid in self.param_grid:
        # XXX: could memoize information used here
        if not sub_grid:
            if ind == 0:
                return {}
            else:
                ind -= 1
                continue

        # Reverse so most frequent cycling parameter comes first
        keys, values_lists = zip(*sorted(sub_grid.items())[::-1])
        sizes = [len(v_list) for v_list in values_lists]
        total = np.product(sizes)

        if ind >= total:
            # Try the next grid
            ind -= total
        else:
            out = {}
            for key, v_list, n in zip(keys, values_lists, sizes):
                ind, offset = divmod(ind, n)
                out[key] = v_list[offset]
            return patch_params(out)

    raise IndexError('ParameterGrid index out of range')

ParameterGrid.__getitem__ = monkey_getitem__