用 Spacy 分块句子

数据挖掘 机器学习 nlp nltk 斯派西
2021-10-13 05:52:25

我有很多句子(500k)看起来像这样:

"Penalty missed! Bad penalty by Felipe Brisola  - Riga FC -  shot with right foot is very close to the goal. Felipe Brisola should be disappointed."
"Penalty saved! Damir Kojasevic  - Sutjeska Niksic -  fails to capitalise on this great opportunity,  shot with right foot saved  in the centre of the goal."   
"Penalty saved! Stefan Panic  - Riga FC -  fails to capitalise on this great opportunity,  shot with right foot saved  in the centre of the goal."
"Penalty saved! Georgie Kelly  - Dundalk -  fails to capitalise on this great opportunity,  shot with right foot saved  in the centre of the goal."
"Penalty missed! Still  FC København 1, Crvena Zvezda 1. Marko Marin  - Crvena Zvezda -  hits the bar with a shot with right foot."

如您所见,它们并不是真正的机器人,在最终编写了 1500 行 php 代码(使用正则表达式)并且仍然不一致之后,我决定看看我的机器学习替代方案。

我想要实现的是:

For example this one:

"Penalty saved! Stefan Panic  - Riga FC -  fails to capitalise on this great opportunity,  shot with right foot saved  in the centre of the goal."

type => penalty
action => saved
reason => shot with right foot saved  in the centre of the goal
person => Stefan Panic

我偶然发现了 spaCy 并看到了“命名实体识别”,并认为也许我可以将它用于此目的。特别是因为我有大量的训练数据。

我想问:spaCy 的命名实体识别是否适合这项任务?如果没有,我应该为这项任务学习什么?

PS:我对python有点了解,但对ML一无所知

3个回答

命名实体识别 (NER) 将提取人员、组织等的名称。例子:

"Penalty missed! Bad penalty by <person>Felipe Brisola</person>  - <organization>Riga FC</organization> -  shot with right foot is very close to the goal. <person>Felipe Brisola</person> should be disappointed."

所以它可能对“人”字段有帮助,但可能对其他字段没有帮助。请注意,您也可以训练一个类似于 NER 的系统来预测其他字段,但它需要大量的注释数据,并且不确定是否能正常工作。

您可以在此处使用 spaCy 的依赖项解析和 POS 标记。这将有助于“动作”标记,并通过一些额外的头脑风暴,您应该能够在其余语句上训练您的模型。

您可以通过制作自己的实体提取模式来完成上述任务 - https://spacy.io/usage/rule-based-matching