数据挖掘 - 有没有办法在 Spacy 中定义自定义实体 - 吾爱随笔录

有没有办法在 Spacy 中定义自定义实体

数据挖掘 Python nlp 斯派西

2021-10-02 18:58:03

1）我刚刚开始研究 NLP，基本思想是从文本中提取有意义的信息。为此，我使用“Spacy”。

据我研究，Spacy 有以下实体。

组织机构
人
日期
钱
红衣主教

等等，但我想添加自定义实体，如：

Nokia-3310应该被标记为Mobile 并且XBOX应该被标记为Games

2）我可以在 Spacy 中找到一些已经训练过的模型来工作吗？

1个回答

对于预训练模型，spaCy 有几个不同语言的模型。你可以在他们的官方文档中找到它们https://spacy.io/models

可用的型号有：

英语
德语
法语
西班牙语
葡萄牙语
意大利语
荷兰语
希腊语
多语言

如果您希望在中支持额外的标签NER，您可以在自己的数据集中训练模型。同样，这在 spaCy 和他们的官方文档https://spacy.io/usage/training#ner中是可能的，这里是一个例子

LABEL = "ANIMAL"

TRAIN_DATA = [
    (
        "Horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("Do they bite?", {"entities": []}),
    (
        "horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
    (
        "they pretend to care about your feelings, those horses",
        {"entities": [(48, 54, LABEL)]},
    ),
    ("horses?", {"entities": [(0, 6, LABEL)]}),
]


nlp = spacy.blank("en")  # create blank Language class
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)

ner.add_label(LABEL)  # add new entity label to entity recognizer

optimizer = nlp.begin_training()

move_names = list(ner.move_names)
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

with nlp.disable_pipes(*other_pipes):  # only train NER
    sizes = compounding(1.0, 4.0, 1.001)
    # batch up the examples using spaCy's minibatch
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        batches = minibatch(TRAIN_DATA, size=sizes)
        losses = {}
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
        print("Losses", losses)

如果您想使用现有模型并添加新的自定义标签，您可以阅读他们文档中的链接文章，其中详细描述了该过程。实际上，它与上面的代码非常相似。

其它你可能感兴趣的问题

上一篇如何在 XGBoost 中提取树？下一篇为什么使用正则化而不是减少模型