多字标记器“nltk.tokenize.mwe”基本上根据我从 API 文档中理解的词典,将已经划分为标记的字符串合并。
您可以做的一件事是使用相关的词性 (PoS) 标记对所有单词进行标记和标记,然后根据 PoS 标记定义正则表达式以提取有趣的关键短语。
例如,改编自NLTK 书第 7 章和这篇 博文的示例:
def extract_phrases(my_tree, phrase):
my_phrases = []
if my_tree.label() == phrase:
my_phrases.append(my_tree.copy(True))
for child in my_tree:
if type(child) is nltk.Tree:
list_of_phrases = extract_phrases(child, phrase)
if len(list_of_phrases) > 0:
my_phrases.extend(list_of_phrases)
return my_phrases
def main():
sentences = ["The little yellow dog barked at the cat",
"He studies Information Technology"]
grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
cp = nltk.RegexpParser(grammar)
for x in sentences:
sentence = pos_tag(tokenize.word_tokenize(x))
tree = cp.parse(sentence)
print "\nNoun phrases:"
list_of_noun_phrases = extract_phrases(tree, 'NP')
for phrase in list_of_noun_phrases:
print phrase, "_".join([x[0] for x in phrase.leaves()])
您基于 PoS 标记上的正则表达式定义了一个语法:
grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
cp = nltk.RegexpParser(grammar)
然后将其应用于标记化和标记的句子,生成树:
sentence = pos_tag(tokenize.word_tokenize(x))
tree = cp.parse(sentence)
然后您使用extract_phrases(my_tree, phrase)
递归解析树并提取标记为 NP 的子树。上面的示例将提取以下名词短语:
Noun phrases:
(NP The/DT little/JJ yellow/JJ dog/NN) The_little_yellow_dog
(NP the/DT cat/NN) the_cat
Noun phrases:
(NP Information/NNP Technology/NNP) Information_Technology
Burton DeWilde 有一篇很棒的博客文章,介绍了更多提取有趣关键词的方法:自动关键词提取简介