我尝试了什么:
# -*- coding: utf-8 -*-
from nltk.stem.snowball import GermanStemmer
st = GermanStemmer()
token_groups = [(["experte", "Experte", "Experten", "Expertin", "Expertinnen"], []),
(["geh", "gehe", "gehst", "geht", "gehen", "gehend"], []),
(["gebäude", "Gebäude", "Gebäudes"], []),
(["schön", "schöner", "schönsten"], ["schon"])]
header = "{:<15} [best expected: n/n| best variants: 1/n | overlap: m]: ...".format("name")
print(header)
print('-' * len(header))
for token_group, different_tokens in token_groups:
stemmed_tokens = [st.stem(token) for token in token_group]
different_tokens = [st.stem(token) for token in different_tokens]
nb_expected = sum(1 for token in stemmed_tokens if token == token_group[0])
nb_variants = len(set(stemmed_tokens))
overlap = set(stemmed_tokens).intersection(set(different_tokens))
print("{:<15} [as expected: {}/{}| variants: {}/{} | overlap: {}]: {}".format(token_group[0], nb_expected, len(token_group), nb_variants, len(token_group), len(overlap), stemmed_tokens))
我得到了什么:
experte [as expected: 0/5| variants: 3/5 | overlap: 0]: ['expert', 'expert', 'expert', 'expertin', 'expertinn']
geh [as expected: 3/6| variants: 4/6 | overlap: 0]: ['geh', 'geh', 'gehst', 'geht', 'geh', 'gehend']
gebäude [as expected: 0/3| variants: 1/3 | overlap: 0]: ['gebaud', 'gebaud', 'gebaud']
schön [as expected: 0/3| variants: 1/3 | overlap: 1]: ['schon', 'schon', 'schon']
两个主要问题是:
- 重叠:schön != schon
- 非工作词干,例如 [experte, Expertin, Expertinnen], [ich gehe, du gehst, er geht]
一个不那么严重的问题是符合我的期望。因此,如果词干分析器实际上可以将单词带入基本形式(不仅仅是词干),那么分析起来会更容易。
更多示例
冲突
- 输入 -> 输出!= 冲突
- mittels -> mittel != "Das Mittel"
无与伦比的期望
输入 -> 输出/预期
Mädchen -> madch / Mädchen
- Behaarung -> behaar / Behaarung