我正在使用 scipy 的 lil_matrix 在 python 中计算固定窗口大小的共现矩阵,用于存储计数并通过在每个单词上滑动上下文窗口然后在窗口中计数来计算计数。
现在,对于相对较小的语料库(100 MB 维基百科转储),代码也花费了太多时间。代码是:
def gen_coocur(window_size=5):
'''
Generates coocurrence matrix
'''
# vocab is precomputed.
coocur_matrix = lil_matrix((len(vocab)+1, len(vocab)+1), dtype=np.float64)
for page in self.wiki_extract.get_page():
# word_tokenize is tokenizer from nltk
doc_tokens = word_tokenize(page.decode('utf-8'))
N = len(doc_tokens)
for token in self.vocab:
for i in xrange(0,window_size):
if (token in doc_tokens[0:i] or token in doc_tokens[i:(i+window_size+1)]) and token != doc_tokens[i]:
coocur_matrix[self.vocab[doc_tokens[i]],self.vocab[token]] +=1
for i in xrange(window_size, (N-window_size)):
if token in doc_tokens[(i-window_size):(i+window_size+1)] and token != doc_tokens[i]:
coocur_matrix[self.vocab[doc_tokens[i]],self.vocab[token]] +=1
for i in xrange(N-window_size, N):
if (token in doc_tokens[i:N] or token in doc_tokens[i-window_size:N]) and token != doc_tokens[i]:
coocur_matrix[self.vocab[doc_tokens[i]],self.vocab[token]] +=1
vocab 是一个映射单词 -> wordId 的字典。如何优化此代码以更快地运行?