数据挖掘 - 自定义函数和管道 - 吾爱随笔录

自定义函数和管道

数据挖掘机器学习 Python 交叉验证变压器管道

2022-03-11 10:22:29

我不太习惯使用管道，所以我想知道如何使用自定义函数和管道。

情况：我想用平均值填充一些缺失值，但使用基于其他特征的组。这就是我使用这个自定义函数的原因：

def replaceNullFromGroup(From, To, variable, by):

    # 1. Create aggregation from train dataset
    From_grp = From.groupby(by)[variable].median().reset_index()

    # 2. Merge dataframes
    To_merged = To.merge(From_grp, on=by, suffixes=['_test', '_train'], how = "left")

    # 3. Create dictionaries
    to_cols = [col for col in To_merged.columns if 'test' in col]
    from_cols = [col for col in To_merged.columns if 'train' in col]
    dict_cols =dict(zip(to_cols, from_cols))

    # 4. Replace null values
    for to_col, from_col  in dict_cols.items():
        To_merged[to_col] = np.where(To_merged[to_col].isnull(), 
                                     To_merged[from_col], 
                                     To_merged[to_col])

    # 5. Clean up dataframe    
    To_merged.drop(from_col, axis=1, inplace=True)
    To_merged.columns = To_merged.columns.str.replace('_test', '')
    return To_merged

变量含义：

来自：我获取信息的数据框（训练数据集）
To：我将填充缺失值的数据框（训练和测试数据集）
变量：具有缺失值的变量
作者：我用来分组的变量

我可以在管道中使用此功能，以便可以使用交叉验证避免数据泄漏吗？

非常感谢你

1个回答

要将此逻辑包含到管道中，您必须创建一个自定义转换器。你需要问自己：

[INIT] 我的逻辑中是否有任何参数？
- 您要估算的变量以及您希望此估算所基于的类别。
[FIT] 逻辑的哪一部分与计算转换将是什么有关？
- 当您按组计算中位数（）并以某种方式存储数据以供以后转换时。
[转换] 给定参数（在 1 中）和所做的设置（在 2 中），逻辑的哪一部分转换数据？
- 当您获取参数（访问字典中的特定键）以检索该组的平均值时，然后用它填充缺失值。

这是一个例子：

from sklearn.base import BaseEstimator, TransformerMixin

class CustomImputer(BaseEstimator, TransformerMixin) : 
     def __init__(self, variable, by) : 
          #self.something enables you to include the passed parameters
          #as object attributes and use it in other methods of the class
          self.variable = variable
          self.by = by
          return self

     def fit(self, X, y=None) : 
          self.map = X.groupby(self.by)[variable].mean()
          #self.map become an attribute that is, the map of values to
          #impute in function of index (corresponding table, like a dict)
          return self

     def transform(self, X, y=None) : 
          X[variable] = X[variable].fillna(value = X[by].map(self.map))
          #Change the variable column. If the value is missing, value should 
          #be replaced by the mapping of column "by" according to the map you
          #created in fit method (self.map)
          return X

现在，它可以包含在任何管道中：

#Minimal example, you could include this imputer in columns transformer to 
#apply it multiple time
pipeline = Pipeline(steps = [('myImputer', CustomImputer('variabletofill',
                                                         'based_on_variable'),
                              ('model', LinearRegression())])

y_pred = pipeline.fit(X_train, y_train).predict(X_test)

如您所见，映射仅基于训练数据计算。然后重新使用它来估算缺失值。它是防数据泄露的。 这是一篇很好的文章，解释了如何创建自定义转换器。

希望这可以帮助

其它你可能感兴趣的问题

上一篇在 Python3 中填充嵌入式列表的缺失值下一篇我应该在平衡数据集之前还是之后缩放数据？