理解 Kneser-Ney 公式以实现

数据挖掘 nlp 数学 语言模型 ngram
2022-02-12 20:30:46

我正在尝试在 Python 中实现这个公式

max(cKN(win+1id),0)cKN(win+1i1)+λ(cKN(win+1i1)P(cKN(wi|win+2i1)

在哪里

cKN()={count()for the highest order continuationcount()otherwise.

在此链接之后我能够理解如何实现等式的前半部分,即

max(cKN(win+1id),0)cKN(win+1i1)

但下半部分特别是目前术语让我感到困惑。链接中的作者指出λ(cKN(win+1i1)

λ(wi1)=dc(wi1)|{w:c(wi1,w)>0}|

但是当我们处于最高阶时,贴现因子为零。作者接着说“分数中的分母是 2-gram 的特殊情况下半最终词的频率,但是在我们正在开发的递归方案中,我们应该考虑在最终词之前的整个字符串(嗯,对于 2-gram 的情况,半词尾词是整个字符串)。小数右边的词是指在字符串之后的不同词尾词类型的数量(而不是频率)。d

他继续举例,但这对我来说毫无意义。我真的找不到任何好的材料来解释如何计算这个项。λ(cKN(win+1i1)

任何与此相关的建议或材料都会有所帮助。回想一下,我是在数字上做这个的,所以我需要帮助分解这个方程,这样我才能对其进行编码,因此我需要了解每个术语是什么。

按照上面的链接,这是我想出的初步实现

def ksener_ney_smoothing(previous_tokens, ngram_dict, discounting_factor=0.75):
    suggestions = []
    
    # Start with previous_token (user input)
    previous_ngram = tuple(previous_tokens)
    previous_ngram_minus_last_word = tuple(previous_tokens[:len(previous_tokens)-1]) # w
    len_previous_ngram = len(previous_ngram)
    
    # Pull ngrams for the highest order
    highest_order_ngrams_map = ngram_dict[len_previous_ngram]
    second_higest_order_ngrams_map = ngram_dict[len_previous_ngram - 1]
    
    # Check if the user input is in the highest order ngram map
    if previous_ngram in highest_order_ngrams_map:
#         discounting_factor = 0 # From the link if the users input is in the highest order ngrams then its 0, found no where in literature
        first_num = max(highest_order_ngrams_map[previous_ngram] - discounting_factor, 0)
        first_denom = second_higest_order_ngrams_map[previous_ngram_minus_last_word]
#         print(first_num, " ", first_denom)
        
        l1 = list(list(x) for x in second_higest_order_ngrams_map.keys())
        lamb_denom = [item for sublist in l1 for item in sublist].count(previous_tokens[-2])
        l2 = list(list(x) for x in highest_order_ngrams_map.keys())
        myset = {item[2] for item in l2 if item[:2] == previous_tokens[:len(previous_tokens)-1]}
        lamb = (discounting_factor/lamb_denom)*len(myset)
#         print(lamb)

        pcont_num = [word[-1] for word in list(ngram_dict_ex[3].keys())].count(previous_ngram[-1])
        pcont_denom = len(highest_order_ngrams_map)
        pcont = pcont_num/pcont_denom
        
        return first_num / first_denom + lamb*pcont
    
    else:
        pass

我能够匹配他的语料库和数字示例,所以应该在正确的轨道上,我只是不确定递归部分。

0个回答
没有发现任何回复~