1)我刚刚开始研究 NLP,基本思想是从文本中提取有意义的信息。为此,我使用“Spacy”。
据我研究,Spacy 有以下实体。
- 组织机构
- 人
- 日期
- 钱
- 红衣主教
等等,但我想添加自定义实体,如:
Nokia-3310
应该被标记为Mobile
并且XBOX
应该被标记为Games
2)我可以在 Spacy 中找到一些已经训练过的模型来工作吗?
1)我刚刚开始研究 NLP,基本思想是从文本中提取有意义的信息。为此,我使用“Spacy”。
据我研究,Spacy 有以下实体。
等等,但我想添加自定义实体,如:
Nokia-3310
应该被标记为Mobile
并且XBOX
应该被标记为Games
2)我可以在 Spacy 中找到一些已经训练过的模型来工作吗?
对于预训练模型,spaCy 有几个不同语言的模型。你可以在他们的官方文档中找到它们https://spacy.io/models
可用的型号有:
如果您希望在 中支持额外的标签NER
,您可以在自己的数据集中训练模型。同样,这在 spaCy 和他们的官方文档https://spacy.io/usage/training#ner中是可能的,这里是一个例子
LABEL = "ANIMAL"
TRAIN_DATA = [
(
"Horses are too tall and they pretend to care about your feelings",
{"entities": [(0, 6, LABEL)]},
),
("Do they bite?", {"entities": []}),
(
"horses are too tall and they pretend to care about your feelings",
{"entities": [(0, 6, LABEL)]},
),
("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
(
"they pretend to care about your feelings, those horses",
{"entities": [(48, 54, LABEL)]},
),
("horses?", {"entities": [(0, 6, LABEL)]}),
]
nlp = spacy.blank("en") # create blank Language class
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label(LABEL) # add new entity label to entity recognizer
optimizer = nlp.begin_training()
move_names = list(ner.move_names)
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes): # only train NER
sizes = compounding(1.0, 4.0, 1.001)
# batch up the examples using spaCy's minibatch
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
batches = minibatch(TRAIN_DATA, size=sizes)
losses = {}
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
print("Losses", losses)
如果您想使用现有模型并添加新的自定义标签,您可以阅读他们文档中的链接文章,其中详细描述了该过程。实际上,它与上面的代码非常相似。