为什么 PyTorch 中的转换器教程有乘以 sqrt 的输入数?我知道在多头自注意力中有一个除以 sqrt(D),但是为什么与编码器的输出有类似的东西呢?特别是因为原始论文似乎没有提到它。
特别是(https://pytorch.org/tutorials/beginner/translation_transformer.html):
src = self.encoder(src) * math.sqrt(self.ninp)
或者这个(https://pytorch.org/tutorials/beginner/transformer_tutorial.html):
# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size: int, emb_size):
super(TokenEmbedding, self).__init__()
self.embedding = nn.Embedding(vocab_size, emb_size)
self.emb_size = emb_size
def forward(self, tokens: Tensor):
return self.embedding(tokens.long()) * math.sqrt(self.emb_size)
请注意,我知道注意力层有这个等式:
他们在论文的一个空白处争论为什么会这样(关于方差总和为 1)。
这与该评论有关吗?它是如何相关的?这在原始论文中提到了吗?
交叉贴:
- https://discuss.pytorch.org/t/why-does-the-transformer-tutorial-have-a-multiplication-by-square-root-of-the-number-of-inputs/126738/6
- https://www.reddit.com/r/learnmachinelearning/comments/okfd7g/why_does_the_embedding_of_tokens_to_the/
- https://www.reddit.com/r/pytorch/comments/op0z2t/why_are_the_embeddings_of_tokens_multiplied_by/
- https://www.reddit.com/r/LanguageTechnology/comments/op13ep/why_are_the_embeddings_of_tokens_multiplied_by/
- https://www.reddit.com/r/deeplearning/comments/opcnwy/why_are_the_embeddings_of_tokens_multiplied_by/
- https://qr.ae/pGuVsC