我正在寻找从包含不同类型标点符号的文本段落中提取句子的方法。我用SpaCy
'sSentencizer
开头。
示例输入 python 列表abstracts
:
["A total of 2337 articles were found, and, according to the inclusion and exclusion criteria used, 22 articles were included in the study. Inhibitory activity against 96% (200/208) and 95% (312/328) of the pathogenic fungi tested was described for Eb and [(PhSe)2], respectively. Including in these 536 fungal isolates tested, organoselenium activity was highlighted against Candida spp., Cryptococcus ssp., Trichosporon spp., Aspergillus spp., Fusarium spp., Pythium spp., and Sporothrix spp., with MIC values lower than 64 mug/mL. In conclusion, Eb and [(PhSe)2] have a broad spectrum of in vitro inhibitory antifungal activity."]
代码:
from spacy.lang.en import English
nlp = English()
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
# read the sentences into a list
for doc in abstracts[:5]:
do = nlp(doc)
for sent in list(do.sents):
print(sent)
输出:
A total of 2337 articles were found, and, according to the inclusion and exclusion criteria used, 22 articles were included in the study.
Inhibitory activity against 96% (200/208) and 95% (312/328) of the pathogenic fungi tested was described for Eb and [(PhSe)2], respectively.
Including in these 536 fungal isolates tested, organoselenium activity was highlighted against Candida spp.,
Cryptococcus ssp.,
Trichosporon spp.,
Aspergillus spp.,
Fusarium spp.,
Pythium spp.,
and Sporothrix spp.,
with MIC values lower than 64 mug/mL. In conclusion, Eb and [(PhSe)2] have a broad spectrum of in vitro inhibitory antifungal activity.
它适用于普通文本,但当.
句子中除结尾以外的其他地方存在点 ( ) 时会失败,这会破坏整个句子,如上面的输出所示。我们如何解决这个问题?是否有任何其他经过验证的方法或库来执行此任务?