机器算法验证 - 为什么令牌的嵌入乘以D--√D（注意不除以 D 的平方根）在变压器中？ - 吾爱随笔录

为什么 PyTorch 中的转换器教程有乘以 sqrt 的输入数？我知道在多头自注意力中有一个除以 sqrt(D)，但是为什么与编码器的输出有类似的东西呢？特别是因为原始论文似乎没有提到它。

特别是（https://pytorch.org/tutorials/beginner/translation_transformer.html）：

src = self.encoder(src) * math.sqrt(self.ninp)

或者这个（https://pytorch.org/tutorials/beginner/transformer_tutorial.html）：

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

请注意，我知道注意力层有这个等式：

α = A t t e n t i o n (Q, K, V) = S o f t M a x (\frac{Q K^{⊤}}{\sqrt{D}}) V

$\alpha = Attention(Q,K,V) = SoftMax( \frac{ Q K^\top }{\sqrt{D}} ) V$

他们在论文的一个空白处争论为什么会这样（关于方差总和为 1）。

这与该评论有关吗？它是如何相关的？这在原始论文中提到了吗？

交叉贴：