数据挖掘 - tensorflow中指标列和分类标识列的区别 - 吾爱随笔录

tensorflow中指标列和分类标识列的区别

数据挖掘机器学习张量流

2021-09-16 08:35:08

我正在学习 Tensorflow 并遇到了 Tensorflow 中使用的不同特征列。在这些类型中，有两个是categorical_identity_column和indicator_column。两者的定义方式相同。据我了解，两者都将分类列转换为单热编码列。

所以我的问题是两者之间有什么区别？什么时候用一个，什么时候用另一个？

3个回答

indicator_column将输入编码为multi-hot表示，而不是one-hot编码。

该示例阐明了更多：

name = indicator_column(categorical_column_with_vocabulary_list(
    'name', ['bob', 'george', 'wanda'])
columns = [name, ...]
features = tf.parse_example(..., features=make_parse_example_spec(columns))
dense_tensor = input_layer(features, columns)

dense_tensor == [[1, 0, 0]]  # If "name" bytes_list is ["bob"]
dense_tensor == [[1, 0, 1]]  # If "name" bytes_list is ["bob", "wanda"]
dense_tensor == [[2, 0, 0]]  # If "name" bytes_list is ["bob", "bob"]

最后两个示例描述了multi-hot编码的含义。例如，如果输入是["bob", "wanda"]编码将是[[1, 0, 1]].

您将使用categorical_column_with_*将 a_CategoricalColumn输入线性模型；此列返回标识值，通常使用词汇表。

另一方面，indicator_column是给定分类列的多热表示，例如，如果您想将特征输入 DNN，则可以使用它；它产生一个_IndicatorColumn. embedding_column是类似的，但如果你的输入是稀疏的，你会使用它。

关于上面评论中的问题（由 Ankit Seth 提出），这里的文档对深度模型（而不是“宽”，即线性）说了以下内容：

tf.estimator.DNNClassifier和tf.estimator.DNNRegressor：只接受密集列。其他列类型必须包含在 an indicator_column或embedding_column中。

而如果你尝试将分类列直接传递给深度模型，TF 会抛出以下错误：

ValueError：feature_columns 的项目必须是 _DenseColumn。您可以使用 embedding_column 或 indicator_column 包装分类列。

其它你可能感兴趣的问题

上一篇将数据框中的最后一列移到第一位下一篇这是一个 Q 学习算法还是只是蛮力？