数据挖掘 - 一个热标签编码 Scikit_learn 转换回数据帧 - 吾爱随笔录

一个热标签编码 Scikit_learn 转换回数据帧

数据挖掘 scikit-学习预处理

2022-03-12 03:38:13

我有一个具有 4 个特征和 1 个目标的数据框。这 4 个特征是 3 个分类和 1 个数值。

我创建了 X，它是 3 个分类特征的新数据框。我使用一种热标签编码，但现在它是一个 numpy 数组。为什么？

我应该将其转换回数据框吗？为什么不？

现在将 X 与我的 1 数字特征合并的最佳做法是什么？

3个回答

我应该将其转换回数据框吗？为什么不？

如果您有一些特定要求，例如将数据保存在文件中或想要执行一些可以在 DataFrame 上更好地运行的特定操作，那么将其转换回 DataFrame 是一个不错的选择。否则使用 numpy 数组应该没问题，即使 Scikit_learn 不同的算法也将 numpy 数组作为输入。

现在将 X 与我的 1 数字特征合并的最佳做法是什么？

我可以分享我的经验以及我到底做了什么。

单独保存并删除分类特征并将其余特征移动到 numpy 数组中。

将分类特征转换为 OneHot 编码。

将 OneHot Encoding numpy 数组与其余功能连接起来，并将此数组用于模型训练。

转换为数据框并合并
使用数组（编码）和具有连续值的列作为输入创建一个新的 df
对数据帧中的 one-hot-encoding执行就地操作

我应该将其转换回数据框吗？为什么不？

大多数 sklearn 转换器，如 LabelBinarizer 输出 numpy 数组（它是 scikit learn 的设计原则之一），因此在管道中使用 ndarray 更容易。除非您绝对需要 panda 的某些功能，否则使用 ndarray 是个好主意

现在将 X 与我的 1 数字特征合并的最佳做法是什么？

我建议将 Pipeline 与 FeatureUnion 一起使用。FeatureUnion 将并行运行每个流水线并合并所有流水线的结果。请看下面的示例代码

    class DataFrameSelector(TransformerMixin, BaseEstimator):
        def __init__(self, include=None, exclude=None):
            self.include = include
            self.exclude = exclude

        def fit(self, X, y=None):
            return self

        def transform(self, X, y=None):
            """
            Returns only attributes listed in %include parameter if it is not None else return all attributes except listed
            in %exclude parameter
            """
            if self.include:
                return X[self.include].copy()
            else:
                return X.drop(self.exclude, axis=1)

    """Wrapper for LabelBinarizer as it only takes one parameter for 
    fit and transform methods and is not working with pipeline"""
    class LblBinarizer(TransformerMixin, BaseEstimator):
        def __init__(self):
            self.binarizer = LabelBinarizer()

        def fit(self, X, y=None):
            return self.binarizer.fit(X)

        def transform(self,X,y=None):
            return self.binarizer.transform(X)


    cat_pipeline = Pipeline(
    [
        ("select categorical features",  prepare_data.DataFrameSelector(include=["ocean_proximity"])),
        ("Binarize categorical features", LblBinarizer())
    ])

    num_pipeline = Pipeline(
    [
        ("select numerical features", prepare_data.DataFrameSelector(exclude=["ocean_proximity"]))
    ])

    full_pipeline = FeatureUnion(transformer_list=[
        ("num pipeline", num_pipeline),
        ("cat pipeline", cat_pipeline)
    ])

    prepared_data = full_pipeline.fit_transform(housing_features)

其它你可能感兴趣的问题

上一篇除了 SVM 的双重表示中的支持向量之外，拉格朗日乘数如何为零？下一篇是否有任何算法可以解决涉及无限类且每个类只有几个实例的分类问题