数据挖掘 - 如何在变压器上训练模型以输出序列？ - 吾爱随笔录

如何在变压器上训练模型以输出序列？

数据挖掘火炬变压器顺序序列到序列

2022-03-03 14:06:05

我正在使用 huggingface 构建一个能够识别给定句子中的错误的模型。假设我有一个给定的句子和一个相应的标签如下 - >

correct_sentence = "we used to play together."
correct_label = [1, 1, 1, 1, 1]

changed_sentence = "we use play to together."
changed_label = [1, 2, 2, 2, 1]

这些标签进一步用 0 填充到等长的512. 句子也被标记化并向上（或向下）填充到这个长度。模型如下：

class Camembert(torch.nn.Module):
    """
    The definition of the custom model, last 15 layers of Camembert will be retrained
    and then a fcn to 512 (the size of every label).
    """
    def __init__(self, cam_model):
        super(Camembert, self).__init__()
        self.l1 = cam_model
        total_layers = 199
        for i, param in enumerate(cam_model.parameters()):
            if total_layers - i > hparams["retrain_layers"]:
                param.requires_grad = False
            else:
                pass
        self.l2 = torch.nn.Dropout(hparams["dropout_rate"])
        self.l3 = torch.nn.Linear(768, 512)

    def forward(self, ids, mask):
        _, output = self.l1(ids, attention_mask=mask)
        output = self.l2(output)
        output = self.l3(output)
        return output

说，batch_size=2因此，输出层将(2, 512)与 target_label 相同。据我所知，这种方法就像说512要分类的类不是我想要的，当我尝试计算损失时出现问题，torch.nn.CrossEntropyLoss()这给了我以下错误（截断）：

 File "D:\Anaconda\lib\site-packages\torch\nn\functional.py", line 1838, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), igno
re_index)
RuntimeError: multi-target not supported at C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/p
ytorch_1579082551706/work/aten/src\THCUNN/generic/ClassNLLCriterion.cu:15

我应该如何解决这个问题，是否有类似模型的教程？

1个回答

我认为您应该将此问题视为二进制分类问题。对于更改后的句子中的每个单词，您将有一个二进制标签：正确或不正确。我建议重新标记，以便“正确”单词的标签为 0，“不正确”的单词标签为 1。在您的示例中，您将拥有：

correct_sentence = "we used to play together"
changed_sentence = "we use play to together"
labels = [0, 1, 1, 1, 0]

而不是用一些特殊值填充，用“正确”标签填充（如果你使用我上面的建议，它将是 0）。

按照惯例，类标签总是从索引 0 开始，所以这个标签方案将匹配 PyTorch 对二元分类问题的期望。

接下来，您需要更改最后Linear一层的激活函数。现在，您的模型仅以Linear一层结束，这意味着输出是无界的。这对于分类问题实际上没有意义，因为您知道输出应该始终在 [0, C-1] 范围内，其中 C 是类的数量。

相反，您应该应用激活函数来使您的输出表现得更像类标签。对于二元分类问题，最终激活的一个不错的选择是torch.nn.Sigmoid。您将像这样修改模型定义：

class Camembert(torch.nn.Module):
    """
    The definition of the custom model, last 15 layers of Camembert will be retrained
    and then a fcn to 512 (the size of every label).
    """
    def __init__(self, cam_model):
        super(Camembert, self).__init__()
        self.l1 = cam_model
        total_layers = 199
        for i, param in enumerate(cam_model.parameters()):
            if total_layers - i > hparams["retrain_layers"]:
                param.requires_grad = False
            else:
                pass
        self.l2 = torch.nn.Dropout(hparams["dropout_rate"])
        self.l3 = torch.nn.Linear(768, 512)
        self.activation = torch.nn.Sigmoid()

    def forward(self, ids, mask):
        _, output = self.l1(ids, attention_mask=mask)
        output = self.l2(output)
        output = self.l3(output)
        output = self.activation(output)
        return output

您的输出现在将具有维度 (batch_size, 512, 1)。512 个输出中的每一个都是介于 0 和 1 之间的数字。您可以将其视为每个特定标记“不正确”的概率。如果输出大于 0.5，则标签变为“不正确”。否则，标签是“正确的”。

最后，由于您将问题视为二元分类问题，因此您需要使用二元交叉熵损失 ( torch.nn.BCELoss)。请注意，您必须unsqueeze使用标签使其尺寸与输出的尺寸相匹配。

model = Camembert(cam_model)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

input = <tokenized, padded input sequence>
labels = torch.tensor([0, 1, 1, 1, 0, . . .  , 0])
output = model(input)
loss = criterion(output, labels.unsqueeze(1))

optimizer.zero_grad()
loss.backward()
optimizer.step()

其它你可能感兴趣的问题

上一篇对整个数据集进行日志缩放下一篇如何在多类分类中最大化特定标签的召回分数？