数据挖掘 - Bahdanau - Luong Attentions 如何使用查询、值、键向量？ - 吾爱随笔录

Bahdanau - Luong Attentions 如何使用查询、值、键向量？

数据挖掘深度学习张量流 rnn 变压器注意机制

2022-02-15 15:31:02

在最新的TensorFlow 2.1中，tensorflow.keras.layers子模块包含AdditiveAttention()和Attention()层，分别实现了 Bahdanau 和 Luong 的注意力。（此处和此处的文档。）

这些新类型的层需要query,value和key输入（尽管最新的是可选的）。但是，Query、Value、Key 向量是我一直阅读的关于 Transformer 架构的东西。

当涉及到 Bahdanau 和 Luong 的注意力时，这些向量代表什么？例如，如果我想为一个常见任务训练一个 RNN 模型（比如说时间序列预测），这些输入代表什么？

编辑：我正在考虑使用 seq2seq 进行预测。输入将是一系列给定长度和一系列外部变量。输出将是向前移动 n 步的序列。

2个回答

要回答您的具体问题：

AdditiveAttention() 和 Attention() 层，分别（松散但不完全）基于 Bahdanau 和 Luong 的注意力。
他们使用 2018 年后的查询、值和键语义。要将语义映射到 Bahdanau 或 Luong 的论文，您可以将“查询”视为最后一个解码器隐藏状态。“值”将是编码器输出的集合——编码器的所有隐藏状态。“查询”“参与”所有“值”
如果您通过库代码运行，您将看到查询首先在时间轴上展开，然后有一个密集层确定 w1、w2 的权重。这些权重应用于扩展查询和值，然后将它们相加，最后应用另一个权重“v”。取其中的一个 softmax 来返回注意力权重，然后将这些注意力权重与“值”相乘并相加以返回上下文。这是 Bahdanau 的加法逻辑
然而，在分析 tf.keras.layers.Attention Github 代码以更好地理解如何使用它时，我遇到的第一行是——“这个类适用于 Dense 或 CNN 网络，而不适用于 RNN 网络”。由于您使用的是 RNN，因此我会谨慎使用这一层。一般来说，所有这些现成的层主要用于自我注意，如果你想创建一个类似转换器的模型，你完全取消 RNN 并且只想使用注意力来表示序列，你可以考虑这些课程。
如果您仍然想使用相同的，您可以继续尝试以下操作：


    ##Input 1 = the last decoder hidden state: stminus1
    ##Input 2 = All hidden states of the encoder: lstm_out
    ##Apply Bahdanau additive attention and give me the 
    ##output = context
    context = tf.keras.layers.AdditiveAttention()([stminus1, lstm_out])

您现在可以另外使用上下文来加强预测。

但是，我强烈建议您用不到六行代码编写自己的注意力层。参见例如：https ://stackoverflow.com/questions/63060083/create-an-lstm-layer-with-attention-in-keras-for-multi-label-text-classification/64853996#64853996

带有查询、键和值的注意力的一般公式对应于注意力的重新检索视图：您有一些查询用于根据与它们对应的键检索一些值。

使用 RNN，注意力被用于机器翻译等序列到序列模型。（时间序列预测通常被表述为序列标签。）RNN 解码器中的注意力是这种情况的一个特例：

您只有一个查询，即当前 RNN 状态。（请注意，在训练时您可以访问所有目标词，因此您可以使用完整的查询集。）在原始Bahdanau 的论文中，它是 $s_{i-1}$ 在等式 6 中。
键和值是相同的，它们是编码器状态。在 Keras API 中，如果您不指定键，它将使用值作为键。在 Bahdanau 的论文中，它是 $h_j$ 在等式 5 和 6 中。

然后在 Keras 中实现的 RNN 解码器看起来像这样（基于TensorFlow 教程）：

class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units):
    super(Decoder, self).__init__()
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(
        self.dec_units, return_sequences=True,
        return_state=True, recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)
    self.attention = tf.keras.layers.AdditiveAttention()

  def call(self, x, hidden, enc_output):
    # hidden is the previous hidden state (batch, 1, dec_units)
    # x is the previous output: (batch, 1)

    # enc_output shape == (batch_size, src_length, hidden_size)
    # hidden shape == (batch_size, 1, dec_units)
    context_vector = self.attention([hidden, enc_output])

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state

其它你可能感兴趣的问题

上一篇K-Means 异常检测不聚类异常下一篇为对象检测创建自定义数据集