我正在使用嵌入并想看看预测附加到某些单词序列的某些分数是多么可行。分数的细节并不重要。
Input (tokenized sentence): ('the', 'dog', 'ate', 'the', 'apple')
Output (float): 0.25
我一直在关注本教程,该教程试图预测此类输入的词性标签。在这种情况下,系统的输出是序列中所有标记的所有可能标签的分布,例如对于三个可能的 POS 类{'DET': 0, 'NN': 1, 'V': 2},输出('the', 'dog', 'ate', 'the', 'apple')可能是
tensor([[-0.0858, -2.9355, -3.5374],
[-5.2313, -0.0234, -4.0314],
[-3.9098, -4.1279, -0.0368],
[-0.0187, -4.7809, -4.5960],
[-5.8170, -0.0183, -4.1879]])
每行是一个token,token中最高值的索引是最好的预测POS标签。
我对这个例子理解得比较好,所以我想把它改成回归问题。完整的代码如下,但我试图理解输出。
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(1)
class LSTMRegressor(nn.Module):
def __init__(self, embedding_dim, hidden_dim, vocab_size):
super(LSTMRegressor, self).__init__()
self.hidden_dim = hidden_dim
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
# The LSTM takes word embeddings as inputs, and outputs hidden states
# with dimensionality hidden_dim.
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
# The linear layer that maps from hidden state space to a single output
self.linear = nn.Linear(hidden_dim, 1)
self.hidden = self.init_hidden()
def init_hidden(self):
# Before we've done anything, we dont have any hidden state.
# Refer to the Pytorch documentation to see exactly
# why they have this dimensionality.
# The axes semantics are (num_layers, minibatch_size, hidden_dim)
return (torch.zeros(1, 1, self.hidden_dim),
torch.zeros(1, 1, self.hidden_dim))
def forward(self, sentence):
embeds = self.word_embeddings(sentence)
lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)
regression = F.relu(self.linear(lstm_out.view(len(sentence), -1)))
return regression
def prepare_sequence(seq, to_ix):
idxs = [to_ix[w] for w in seq]
return torch.tensor(idxs, dtype=torch.long)
# ================================================
training_data = [
("the dog ate the apple".split(), 0.25),
("everybody read that book".split(), 0.78)
]
word_to_ix = {}
for sent, tags in training_data:
for word in sent:
if word not in word_to_ix:
word_to_ix[word] = len(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}
# ================================================
EMBEDDING_DIM = 6
HIDDEN_DIM = 6
model = LSTMRegressor(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix))
loss_function = nn.MSELoss()
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()))
# See what the results are before training
with torch.no_grad():
inputs = prepare_sequence(training_data[0][0], word_to_ix)
regr = model(inputs)
print(regr)
for epoch in range(100): # again, normally you would NOT do 300 epochs, it is toy data
for sentence, target in training_data:
# Step 1. Remember that Pytorch accumulates gradients.
# We need to clear them out before each instance
model.zero_grad()
# Also, we need to clear out the hidden state of the LSTM,
# detaching it from its history on the last instance.
model.hidden = model.init_hidden()
# Step 2. Get our inputs ready for the network, that is, turn them into
# Tensors of word indices.
sentence_in = prepare_sequence(sentence, word_to_ix)
target = torch.tensor(target, dtype=torch.float)
# Step 3. Run our forward pass.
score = model(sentence_in)
# Step 4. Compute the loss, gradients, and update the parameters by
# calling optimizer.step()
loss = loss_function(score, target)
loss.backward()
optimizer.step()
# See what the results are after training
with torch.no_grad():
inputs = prepare_sequence(training_data[0][0], word_to_ix)
regr = model(inputs)
print(regr)
输出是:
# Before training
tensor([[0.0000],
[0.0752],
[0.1033],
[0.0088],
[0.1178]])
# After training
tensor([[0.6181],
[0.4987],
[0.3784],
[0.4052],
[0.4311]])
但我不明白为什么。我期待一个单一的输出。张量的大小与输入的标记数相同。然后,我会猜测,对于输入中的每一步,都会给出隐藏状态。那是对的吗?这是否意味着张量中的最后一项 ( tensor[-1],或者它是第一项tensor[0]?) 是最终预测?为什么给出所有输出?还是我在前传中的误解?也许我应该只将 LSTM 层的最后一项提供给线性层?
我也很想知道这如何外推到双向 LSTM 和多层 LSTM,甚至如何与 GRU(双向或非双向)一起工作。
赏金将提供给能够解释为什么我们会使用最后一个输出或最后一个隐藏状态或从目标导向的角度来看差异意味着什么的人。此外,欢迎提供有关多层架构和双向 RNN 的一些信息。例如,将双向 LSTM/GRU 的输出和隐藏状态相加或连接以使您的数据形成合理的形状是常见的做法吗?如果是这样,你怎么做?