有没有办法在 Spacy 中定义自定义实体

数据挖掘 Python nlp 斯派西
2021-10-02 18:58:03

1)我刚刚开始研究 NLP,基本思想是从文本中提取有意义的信息。为此,我使用“Spacy”。

据我研究,Spacy 有以下实体。

  • 组织机构
  • 日期
  • 红衣主教

等等,但我想添加自定义实体,如:

Nokia-3310应该被标记为Mobile 并且XBOX应该被标记为Games

2)我可以在 Spacy 中找到一些已经训练过的模型来工作吗?

1个回答

对于预训练模型,spaCy 有几个不同语言的模型。你可以在他们的官方文档中找到它们https://spacy.io/models

可用的型号有:

  1. 英语
  2. 德语
  3. 法语
  4. 西班牙语
  5. 葡萄牙语
  6. 意大利语
  7. 荷兰语
  8. 希腊语
  9. 多语言

如果您希望在 中支持额外的标签NER,您可以在自己的数据集中训练模型。同样,这在 spaCy 和他们的官方文档https://spacy.io/usage/training#ner中是可能的,这里是一个例子

LABEL = "ANIMAL"

TRAIN_DATA = [
    (
        "Horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("Do they bite?", {"entities": []}),
    (
        "horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
    (
        "they pretend to care about your feelings, those horses",
        {"entities": [(48, 54, LABEL)]},
    ),
    ("horses?", {"entities": [(0, 6, LABEL)]}),
]


nlp = spacy.blank("en")  # create blank Language class
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)

ner.add_label(LABEL)  # add new entity label to entity recognizer

optimizer = nlp.begin_training()

move_names = list(ner.move_names)
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

with nlp.disable_pipes(*other_pipes):  # only train NER
    sizes = compounding(1.0, 4.0, 1.001)
    # batch up the examples using spaCy's minibatch
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        batches = minibatch(TRAIN_DATA, size=sizes)
        losses = {}
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
        print("Losses", losses)

如果您想使用现有模型并添加新的自定义标签,您可以阅读他们文档中的链接文章,其中详细描述了该过程。实际上,它与上面的代码非常相似。