有没有好的德国词干?

数据挖掘 nlp nltk 词干
2021-10-11 09:09:43

我尝试了什么:

# -*- coding: utf-8 -*-

from nltk.stem.snowball import GermanStemmer
st = GermanStemmer()

token_groups = [(["experte", "Experte", "Experten", "Expertin", "Expertinnen"], []),
                (["geh", "gehe", "gehst", "geht", "gehen", "gehend"], []),
                (["gebäude", "Gebäude", "Gebäudes"], []),
                (["schön", "schöner", "schönsten"], ["schon"])]
header = "{:<15} [best expected: n/n| best variants: 1/n | overlap: m]: ...".format("name")
print(header)
print('-' * len(header))
for token_group, different_tokens in token_groups:
    stemmed_tokens = [st.stem(token) for token in token_group]
    different_tokens = [st.stem(token) for token in different_tokens]
    nb_expected = sum(1 for token in stemmed_tokens if token == token_group[0])
    nb_variants = len(set(stemmed_tokens))
    overlap = set(stemmed_tokens).intersection(set(different_tokens))
    print("{:<15} [as expected: {}/{}| variants: {}/{} | overlap: {}]: {}".format(token_group[0], nb_expected, len(token_group), nb_variants, len(token_group), len(overlap), stemmed_tokens))

我得到了什么:

experte  [as expected: 0/5| variants: 3/5 | overlap: 0]: ['expert', 'expert', 'expert', 'expertin', 'expertinn']
geh      [as expected: 3/6| variants: 4/6 | overlap: 0]: ['geh', 'geh', 'gehst', 'geht', 'geh', 'gehend']
gebäude  [as expected: 0/3| variants: 1/3 | overlap: 0]: ['gebaud', 'gebaud', 'gebaud']
schön    [as expected: 0/3| variants: 1/3 | overlap: 1]: ['schon', 'schon', 'schon']

两个主要问题是:

  • 重叠:schön != schon
  • 非工作词干,例如 [experte, Expertin, Expertinnen], [ich gehe, du gehst, er geht]

一个不那么严重的问题是符合我的期望。因此,如果词干分析器实际上可以将单词带入基本形式(不仅仅是词干),那么分析起来会更容易。

更多示例

冲突

  • 输入 -> 输出!= 冲突
  • mittels -> mittel != "Das Mittel"

无与伦比的期望

  • 输入 -> 输出/预期

  • Mädchen -> madch / Mädchen

  • Behaarung -> behaar / Behaarung
3个回答

大问题和非常好的问题!

spacy过去用过,它有一个德语模块我猜不支持词干提取,但lemmatization

spacy查看下面的输出,老实说,我认为这不会解决您的问题。但是,我只是想让你知道这个选项。

Spacy 词形还原:

#pip install spacy
#python -m spacy download de

import spacy
nlp = spacy.load('de_core_news_sm')

mywords = "Das ist schon sehr schön mit den Expertinnen und Experten"

for t in nlp.tokenizer(mywords):
    print("Tokenized: %s | Lemma: %s" %(t, t.lemma_))

结果:

Tokenized: Das | Lemma: der
Tokenized: ist | Lemma: sein
Tokenized: schon | Lemma: schon
Tokenized: sehr | Lemma: sehr
Tokenized: schön | Lemma: schön
Tokenized: mit | Lemma: mit
Tokenized: den | Lemma: der
Tokenized: Expertinnen | Lemma: Expertinnen
Tokenized: und | Lemma: und
Tokenized: Experten | Lemma: Experte

这个问题已经有将近 2 年的历史了,但我想很多人都在为同一个问题而苦苦挣扎。

许多人为此使用TreeTaggerTreeTagger 进行 POS-Tagging 和 limmatization,但您需要手动安装 TreeTagger(但这很容易做到),如果您使用 Python,则需要安装 Python-wrapper:

import treetaggerwrapper
import nltk
from pprint import pprint

tree_tagger = treetaggerwrapper.TreeTagger(TAGLANG='de')

sent = "Das ist schon sehr schön mit den Expertinnen und Experten."

words = nltk.word_tokenize(sent)
tags = tree_tagger.tag_text(words,tagonly=True) #don't use the TreeTagger's tokenization!
nice_tags = treetaggerwrapper.make_tags(tags)
pprint(nice_tags)

这给出了以下结果:

[Tag(word='Das', pos='PDS', lemma='die'),
 Tag(word='ist', pos='VAFIN', lemma='sein'),
 Tag(word='schon', pos='ADV', lemma='schon'),
 Tag(word='sehr', pos='ADV', lemma='sehr'),
 Tag(word='schön', pos='ADJD', lemma='schön'),
 Tag(word='mit', pos='APPR', lemma='mit'),
 Tag(word='den', pos='ART', lemma='die'),
 Tag(word='Expertinnen', pos='NN', lemma='Expertin'),
 Tag(word='und', pos='KON', lemma='und'),
 Tag(word='Experten', pos='NN', lemma='Experte'),
 Tag(word='.', pos='$.', lemma='.')]

作为替代方案,您可以使用HanTa ( https://github.com/wartaal/HanTa ):

!pip install HanTa
from HanTa import HanoverTagger as ht
import nltk
from pprint import pprint

tagger = ht.HanoverTagger('morphmodel_ger.pgz')

sent = "Das ist schon sehr schön mit den Expertinnen und Experten."

words = nltk.word_tokenize(sent)
lemmata = tagger.tag_sent(words,taglevel= 1)
pprint(lemmata)

对于给定的句子,这给出了与以前基本相同的结果:

[('Das', 'das', 'PDS'),
 ('ist', 'sein', 'VAFIN'),
 ('schon', 'schon', 'ADV'),
 ('sehr', 'sehr', 'ADV'),
 ('schön', 'schön', 'ADJD'),
 ('mit', 'mit', 'APPR'),
 ('den', 'den', 'ART'),
 ('Expertinnen', 'Expertin', 'NN'),
 ('und', 'und', 'KON'),
 ('Experten', 'Experte', 'NN'),
 ('.', '--', '$.')]

最后,您可以使用名为 GermaLemma 的工具,但是,它要求您之前运行另一个用于 POS 标记的工具。

NLTK 有一个专门针对德语的词干分析器的实现,称为Cistem我认为它是在 NLTK 3.4 版中添加的。
虽然您示例的结果看起来只是稍微好一点,但词干分析器的一致性至少比 Snowball 词干分析器好,并且您的许多示例都简化为类似的词干。

仍然删除了元音变音,以及减少了词干(尤其ge-是有时也会删除的前缀),这可能与您的期望不符。下面的结果来自运行Cistem(case_insensitive=True)

name            [best expected: n/n| best variants: 1/n | overlap: m]: ...
--------------------------------------------------------------------------
experte         [as expected: 0/5| variants: 4/5 | overlap: 0]: ['exper', 'expert', 'expert', 'experti', 'expertinn']
geh             [as expected: 5/6| variants: 2/6 | overlap: 0]: ['geh', 'geh', 'geh', 'geh', 'geh', 'hend']
gebäude         [as expected: 0/3| variants: 1/3 | overlap: 0]: ['baud', 'baud', 'baud']
schön           [as expected: 0/3| variants: 2/3 | overlap: 1]: ['schon', 'schoner', 'schon']

编辑:我还应该补充一点,该实现源自 Leonie Weissweiler 和 Alexander Fraser 的工作,他们还在他们的论文中对各种词干分析器进行了很好的比较。