数据挖掘 - 将段落转换为句子 - 吾爱随笔录

将段落转换为句子

数据挖掘 nlp 斯派西标记化信息提取

2021-10-06 02:51:03

我正在寻找从包含不同类型标点符号的文本段落中提取句子的方法。我用SpaCy'sSentencizer开头。

示例输入 python 列表abstracts：

["A total of 2337 articles were found, and, according to the inclusion and exclusion criteria used, 22 articles were included in the study. Inhibitory activity against 96% (200/208) and 95% (312/328) of the pathogenic fungi tested was described for Eb and [(PhSe)2], respectively. Including in these 536 fungal isolates tested, organoselenium activity was highlighted against Candida spp., Cryptococcus ssp., Trichosporon spp., Aspergillus spp., Fusarium spp., Pythium spp., and Sporothrix spp., with MIC values lower than 64 mug/mL. In conclusion, Eb and [(PhSe)2] have a broad spectrum of in vitro inhibitory antifungal activity."]

代码：

from spacy.lang.en import English

nlp = English()
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)

# read the sentences into a list
for doc in abstracts[:5]:
    do = nlp(doc)
    for sent in list(do.sents):
        print(sent)

输出：

A total of 2337 articles were found, and, according to the inclusion and exclusion criteria used, 22 articles were included in the study.
Inhibitory activity against 96% (200/208) and 95% (312/328) of the pathogenic fungi tested was described for Eb and [(PhSe)2], respectively.
Including in these 536 fungal isolates tested, organoselenium activity was highlighted against Candida spp.,
Cryptococcus ssp.,
Trichosporon spp.,
Aspergillus spp.,
Fusarium spp.,
Pythium spp.,
and Sporothrix spp.,
with MIC values lower than 64 mug/mL. In conclusion, Eb and [(PhSe)2] have a broad spectrum of in vitro inhibitory antifungal activity.

它适用于普通文本，但当.句子中除结尾以外的其他地方存在点 ( ) 时会失败，这会破坏整个句子，如上面的输出所示。我们如何解决这个问题？是否有任何其他经过验证的方法或库来执行此任务？

2个回答

Spacy的Sentencizer很简单。然而，Spacy 3.0 包括Sentencerecognizer它基本上是一个可训练的句子标注器，应该表现得更好。这是其成立细节的问题。如果你有分段的句子数据，你可以训练它。

另一种选择是使用NLTK的sent_tokenize，它应该比 Spacy 的 Sentencizer 提供更好的结果。我已经用您的示例对其进行了测试，并且效果很好。

from nltk.tokenize import sent_tokenize
sent_tokenize("A total....")

最后，如果某些缩写sent_tokenize不能很好地工作并且您有一个要支持的缩写列表（例如您的示例中的“spp.”），您可以使用 NLTK 的PunktSentenceTokenizer：

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['spp.']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize("A total ....")

SpaCy 中没有任何东西可以开箱即用。但是，它们允许您使用自定义组件

为了解决您的问题，我至少看到了三种方法。

NTLK

NLTK 允许您添加已知的缩写作为例外。请参阅此StackOverflow 帖子。

使用正则表达式

由于您的问题是您有一些不应表示句子开头的点示例，因此您可以自定义基本正则表达式以包含该行为。这是一个可以帮助您入门的stackoverflow 答案。

后期处理

您还可以使用 SpaCy 的默认分段，然后在句子以已知缩写结尾时合并它们。它不是非常优雅，但它会起作用。

其它你可能感兴趣的问题

上一篇使用完全标准化的训练集进行深度学习有什么缺点？下一篇keras 中的 Flatten() 层是否必要？