我正在尝试实现布朗聚类算法。
论文详细信息:Brown 等人的“基于类的自然语言 n-gram 模型”
该算法应该在O(|V|k^2)
哪里|V|
是词汇的大小,k是集群的数量。我无法有效地实现它。事实上,我能做到的最好的O(|V|k^3)
就是太慢了。我目前对算法主要部分的实现如下:
for w = number of clusters + 1 to |V|
{
word = next most frequent word in the corpus
assign word to a new cluster
initialize MaxQuality to 0
initialize ArgMax vector to (0,0)
for i = 0 to number of clusters - 1
{
for j = i to number of clusters
{
Quality = Mutual Information if we merge cluster i and cluster j
if Quality > MaxQuality
{
MaxQuality = Quality
ArgMax = (i,j)
}
}
}
}
我计算质量如下:
1. Before entering the second loop compute the pre-merge quality i.e. quality before doing any merges.
2. Every time a cluster-pair merge step is considered:
i. assign quality := pre-merge quality
ii. quality = quality - any terms in the mutual information equation that contain cluster i or cluster j (pre-merge)
iii. quality = quality + any terms in the mutual information equation that contain (cluster i U cluster j) (post-merge)
在我的实现中,第一个循环大约有 |V| 迭代,第二个和第三个循环每个大约 k 次迭代。计算每一步的质量需要大约进一步的 k 次迭代。总的来说,它(|V|k^3)
及时运行。
你如何让它运行(|V|k^2)
?