数据挖掘 - 理解 Kneser-Ney 公式以实现 - 吾爱随笔录

我正在尝试在 Python 中实现这个公式

\frac{max (c_{K N} (w_{i - n + 1}^{i} - d), 0)}{c_{K N} (w_{i - n + 1}^{i - 1})} + λ (c_{K N} (w_{i - n + 1}^{i - 1}) P (c_{K N} (w_{i} | w_{i - n + 2}^{i - 1})

$\frac{\text{max}(c_{KN}(w^{i}_{i-n+1} - d), 0)}{c_{KN}(w^{i-1}_{i-n+1})} + \lambda(c_{KN}(w^{i-1}_{i-n+1})\mathbb{P}(c_{KN}(w_{i}|w^{i-1}_{i-n+2})$

在哪里

c_{K N} (\cdot) = {\begin{cases} count (\cdot) & for the highest order \\ continuationcount (\cdot) & otherwise. \end{cases}

$\mathrm{c_{KN}}(\cdot) = \begin{cases} \text{count}(\cdot) & \text{for the highest order } \\ % & is your "\tab"-like command (it's a tab alignment character) \text{continuationcount}(\cdot) & \text{otherwise.} \end{cases}$

在此链接之后，我能够理解如何实现等式的前半部分，即

\frac{max (c_{K N} (w_{i - n + 1}^{i} - d), 0)}{c_{K N} (w_{i - n + 1}^{i - 1})}

$\frac{\text{max}(c_{KN}(w^{i}_{i-n+1} - d), 0)}{c_{KN}(w^{i-1}_{i-n+1})}$

但下半部分特别是目前术语让我感到困惑。链接中的作者指出 $\lambda(c_{KN}(w^{i-1}_{i-n+1})$

λ (w_{i - 1}) = \frac{d}{c (w_{i - 1})} | {w : c (w_{i - 1}, w) > 0} |

$\lambda(w_{i-1}) = \frac{d}{c(w_{i-1})}\left|\{w:c(w_{i-1},w)>0\}\right|$

但是当我们处于最高阶时，贴现因子为零。作者接着说“分数中的分母是 2-gram 的特殊情况下半最终词的频率，但是在我们正在开发的递归方案中，我们应该考虑在最终词之前的整个字符串（嗯，对于 2-gram 的情况，半词尾词是整个字符串）。小数右边的词是指在字符串之后的不同词尾词类型的数量（而不是频率）。 $d$

他继续举例，但这对我来说毫无意义。我真的找不到任何好的材料来解释如何计算这个项。 $\lambda(c_{KN}(w^{i-1}_{i-n+1})$

任何与此相关的建议或材料都会有所帮助。回想一下，我是在数字上做这个的，所以我需要帮助分解这个方程，这样我才能对其进行编码，因此我需要了解每个术语是什么。

按照上面的链接，这是我想出的初步实现

def ksener_ney_smoothing(previous_tokens, ngram_dict, discounting_factor=0.75):
    suggestions = []
    
    # Start with previous_token (user input)
    previous_ngram = tuple(previous_tokens)
    previous_ngram_minus_last_word = tuple(previous_tokens[:len(previous_tokens)-1]) # w
    len_previous_ngram = len(previous_ngram)
    
    # Pull ngrams for the highest order
    highest_order_ngrams_map = ngram_dict[len_previous_ngram]
    second_higest_order_ngrams_map = ngram_dict[len_previous_ngram - 1]
    
    # Check if the user input is in the highest order ngram map
    if previous_ngram in highest_order_ngrams_map:
#         discounting_factor = 0 # From the link if the users input is in the highest order ngrams then its 0, found no where in literature
        first_num = max(highest_order_ngrams_map[previous_ngram] - discounting_factor, 0)
        first_denom = second_higest_order_ngrams_map[previous_ngram_minus_last_word]
#         print(first_num, " ", first_denom)
        
        l1 = list(list(x) for x in second_higest_order_ngrams_map.keys())
        lamb_denom = [item for sublist in l1 for item in sublist].count(previous_tokens[-2])
        l2 = list(list(x) for x in highest_order_ngrams_map.keys())
        myset = {item[2] for item in l2 if item[:2] == previous_tokens[:len(previous_tokens)-1]}
        lamb = (discounting_factor/lamb_denom)*len(myset)
#         print(lamb)

        pcont_num = [word[-1] for word in list(ngram_dict_ex[3].keys())].count(previous_ngram[-1])
        pcont_denom = len(highest_order_ngrams_map)
        pcont = pcont_num/pcont_denom
        
        return first_num / first_denom + lamb*pcont
    
    else:
        pass

我能够匹配他的语料库和数字示例，所以应该在正确的轨道上，我只是不确定递归部分。