数据挖掘 - 命名实体识别中的不平衡类 - 吾爱随笔录

我目前正在研究一个 NER 问题，该问题试图从印度尼西亚语的地址字符串中提取 2 个实体 - 地点（POI）和街道。

我使用了 IndoBert（可在此处获得）并将 FC 层附加到 BERT 模型上，利用交叉熵损失来预测每个单词所属的类别。然而，我的数据集中的典型例句标签看起来像这样[1, 4, 5, 5, 1, 1, 2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

其中 2-5 代表 POI 和街道类别，0 是 pad tokens，1 代表 O tags（BIO tagging）代码。然而，该模型最终会不断预测所有单词的多数类。然后我在损失函数中使用了忽略索引和权重，如下所示：

WEIGHTS = torch.Tensor([0.05, 0.2, 1, 1, 1, 1])
def loss(predicted, target):
    predicted = torch.rot90(predicted, -1, (1,2))
    criterion = nn.CrossEntropyLoss(weight = WEIGHTS,ignore_index=0, reduction='mean')
    return criterion(predicted, target)

然而，虽然这解决了上述问题。该模型也不会学习，损失只是在同一水平上波动。因此，我想知道是否有办法调整此类不平衡或阻止模型预测 [PAD] 令牌的类别。

这是我的模型的代码：

class aem(nn.Module):
    def __init__(self, no_class):
        super().__init__()
        self.bert = AutoModel.from_pretrained("sarahlintang/IndoBERT")
        self.drop1 = nn.Dropout(p=0.1)
        self.l1 = nn.Linear(self.bert.config.hidden_size, no_class)
        self.out = nn.GELU()     
    
    def forward(self, inputs, attn):
        hidden = self.bert(inputs, token_type_ids=None, attention_mask=attn, return_dict = True)
        L1out = self.out(self.l1(self.drop1(hidden[0])))
        return L1out