和你一样,我对我可以在网上找到的音节计数功能的质量并不感到兴奋,所以这是我的看法:
import re
VOWEL_RUNS = re.compile("[aeiouy]+", flags=re.I)
EXCEPTIONS = re.compile(
# fixes trailing e issues:
# smite, scared
"[^aeiou]e[sd]?$|"
# fixes adverbs:
# nicely
+ "[^e]ely$",
flags=re.I
)
ADDITIONAL = re.compile(
# fixes incorrect subtractions from exceptions:
# smile, scarred, raises, fated
"[^aeioulr][lr]e[sd]?$|[csgz]es$|[td]ed$|"
# fixes miscellaneous issues:
# flying, piano, video, prism, fire, evaluate
+ ".y[aeiou]|ia(?!n$)|eo|ism$|[^aeiou]ire$|[^gq]ua",
flags=re.I
)
def count_syllables(word):
vowel_runs = len(VOWEL_RUNS.findall(word))
exceptions = len(EXCEPTIONS.findall(word))
additional = len(ADDITIONAL.findall(word))
return max(1, vowel_runs - exceptions + additional)
我们避免在纯 Python 中循环;同时,这些正则表达式应该易于理解。
这比我在网上找到的各种片段(包括 Pyphen 和 Syllapy 的后备)表现得更好。它使 90% 以上的 cmudict 正确(我发现它的错误是可以理解的)。
cd = nltk.corpus.cmudict.dict()
sum(
1 for word, pron in cd.items()
if count_syllables(word) in (sum(1 for p in x if p[-1].isdigit()) for x in pron)
) / len(cd)
# 0.9073751569397757
相比之下,Pyphen 为 53.8%,而另一个答案中的音节功能为 83.7%。
以下是一些经常出错的词:
from collections import Counter
for word, _ in Counter(nltk.corpus.brown.words()).most_common(1000):
word = word.lower()
if word in cd and count_syllables(word) not in (sum(1 for p in x if p[-1].isdigit()) for x in cd[word]):
print(word)