数据挖掘 - 如何获得一个单词的音节数？ - 吾爱随笔录

如何获得一个单词的音节数？

数据挖掘 nlp

2021-09-20 08:21:09

我已经浏览过这篇文章，它使用nltk'scmudict来计算单词中的音节数：

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]

但是，对于 cmu 字典之外的单词，例如名称：Rohit，它不会给出结果。

那么，有没有其他/更好的方法来计算一个单词的音节？

4个回答

您可以尝试另一个名为Pyphen的 Python 库。它易于使用并支持多种语言。

import pyphen
dic = pyphen.Pyphen(lang='en')
print dic.inserted('Rohit')
>>'Ro-hit'

我遇到了完全相同的问题，这就是我所做的：
捕获在 cmu 的字典中找不到该词时出现的关键错误，如下所示：

from nltk.corpus import cmudict
d = cmudict.dict()

def nsyl(word):
    try:
        return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]
    except KeyError:
        #if word not found in cmudict
        return syllables(word)

调用以下音节函数

def syllables(word):
    #referred from stackoverflow.com/questions/14541303/count-the-number-of-syllables-in-a-word
    count = 0
    vowels = 'aeiouy'
    word = word.lower()
    if word[0] in vowels:
        count +=1
    for index in range(1,len(word)):
        if word[index] in vowels and word[index-1] not in vowels:
            count +=1
    if word.endswith('e'):
        count -= 1
    if word.endswith('le'):
        count += 1
    if count == 0:
        count += 1
    return count

下面是我的做法

def countsyllables(pron):
 return len([char for phone in pron for char in phone if char[-1].isdigit() ])
from nltk.corpus import cmudict
from nltk.corpus import brown
cmudict_dict=cmudict.dict()
sw = stopwords.words('english')
bwns=[w.lower() for w in brown.words() if w.lower() not in sw ]
missingw=[]
syllablecnt=[]
for  w in bwns:
  try:
    syllablecnt.append(countsyllables(cmudict_dict[w]))
  except:
    missingw.append(w)
    continue  
# below is approximate count of syllable in the text brown, there are many missing words too    
sum(syllablecnt)

和你一样，我对我可以在网上找到的音节计数功能的质量并不感到兴奋，所以这是我的看法：

import re

VOWEL_RUNS = re.compile("[aeiouy]+", flags=re.I)
EXCEPTIONS = re.compile(
    # fixes trailing e issues:
    # smite, scared
    "[^aeiou]e[sd]?$|"
    # fixes adverbs:
    # nicely
    + "[^e]ely$",
    flags=re.I
)
ADDITIONAL = re.compile(
    # fixes incorrect subtractions from exceptions:
    # smile, scarred, raises, fated
    "[^aeioulr][lr]e[sd]?$|[csgz]es$|[td]ed$|"
    # fixes miscellaneous issues:
    # flying, piano, video, prism, fire, evaluate
    + ".y[aeiou]|ia(?!n$)|eo|ism$|[^aeiou]ire$|[^gq]ua",
    flags=re.I
)

def count_syllables(word):
    vowel_runs = len(VOWEL_RUNS.findall(word))
    exceptions = len(EXCEPTIONS.findall(word))
    additional = len(ADDITIONAL.findall(word))
    return max(1, vowel_runs - exceptions + additional)

我们避免在纯 Python 中循环；同时，这些正则表达式应该易于理解。

这比我在网上找到的各种片段（包括 Pyphen 和 Syllapy 的后备）表现得更好。它使 90% 以上的 cmudict 正确（我发现它的错误是可以理解的）。

cd = nltk.corpus.cmudict.dict()
sum(
    1 for word, pron in cd.items()
    if count_syllables(word) in (sum(1 for p in x if p[-1].isdigit()) for x in pron)
) / len(cd)
# 0.9073751569397757

相比之下，Pyphen 为 53.8%，而另一个答案中的音节功能为 83.7%。

以下是一些经常出错的词：

from collections import Counter
for word, _ in Counter(nltk.corpus.brown.words()).most_common(1000):
    word = word.lower()
    if word in cd and count_syllables(word) not in (sum(1 for p in x if p[-1].isdigit()) for x in cd[word]):
        print(word)

其它你可能感兴趣的问题

上一篇我们可以删除与目标/标签零相关的特征吗？下一篇对多元时间序列进行分类