数据挖掘 - 如何在 O(|V|k^2) 中实现 Brown 聚类算法 - 吾爱随笔录

我正在尝试实现布朗聚类算法。

论文详细信息：Brown 等人的“基于类的自然语言 n-gram 模型”

该算法应该在O(|V|k^2)哪里|V|是词汇的大小，k是集群的数量。我无法有效地实现它。事实上，我能做到的最好的O(|V|k^3)就是太慢了。我目前对算法主要部分的实现如下：

for w = number of clusters + 1 to |V|
{
   word = next most frequent word in the corpus

   assign word to a new cluster 

   initialize MaxQuality to 0

   initialize ArgMax vector to (0,0)

   for i = 0 to number of clusters - 1 
   {
      for j = i to number of clusters
      {
         Quality = Mutual Information if we merge cluster i and cluster j

         if Quality > MaxQuality
         {
            MaxQuality = Quality 

            ArgMax = (i,j) 
         }
      }
   }
}

我计算质量如下：

1. Before entering the second loop compute the pre-merge quality i.e. quality before doing any merges.
2. Every time a cluster-pair merge step is considered:
    i. assign quality := pre-merge quality
   ii. quality = quality - any terms in the mutual information equation that contain cluster i or cluster j (pre-merge)
  iii. quality = quality + any terms in the mutual information equation that contain (cluster i U cluster j)  (post-merge)

在我的实现中，第一个循环大约有 |V| 迭代，第二个和第三个循环每个大约 k 次迭代。计算每一步的质量需要大约进一步的 k 次迭代。总的来说，它(|V|k^3)及时运行。

你如何让它运行(|V|k^2)？