为了以这样的方式实现共现矩阵,即 word1 在 word2 的上下文中出现在给定值附近的次数,假设为 5。有 100 个单词和一个包含 1000 个句子的列表。那么如何使用python计算大小(100 * 100)的共现矩阵?
如何在python中实现单词到单词的共现矩阵
数据挖掘
Python
word2vec
2022-02-16 14:11:45
2个回答
from nltk.tokenize import word_tokenize
from itertools import combinations
from collections import Counter
sentences = ['i go to london', 'you do not go to london','but london goes to you']
vocab = set(word_tokenize(' '.join(sentences)))
print('Vocabulary:\n',vocab,'\n')
token_sent_list = [word_tokenize(sen) for sen in sentences]
print('Each sentence in token form:\n',token_sent_list,'\n')
co_occ = {ii:Counter({jj:0 for jj in vocab if jj!=ii}) for ii in vocab}
k=2
for sen in token_sent_list:
for ii in range(len(sen)):
if ii < k:
c = Counter(sen[0:ii+k+1])
del c[sen[ii]]
co_occ[sen[ii]] = co_occ[sen[ii]] + c
elif ii > len(sen)-(k+1):
c = Counter(sen[ii-k::])
del c[sen[ii]]
co_occ[sen[ii]] = co_occ[sen[ii]] + c
else:
c = Counter(sen[ii-k:ii+k+1])
del c[sen[ii]]
co_occ[sen[ii]] = co_occ[sen[ii]] + c
# Having final matrix in dict form lets you convert it to different python data structures
co_occ = {ii:dict(co_occ[ii]) for ii in vocab}
display(co_occ)
输出:
Vocabulary:
{'london', 'but', 'goes', 'i', 'do', 'you', 'go', 'not', 'to'}
Each sentence in token form:
[['i', 'go', 'to', 'london'], ['you', 'do', 'not', 'go', 'to', 'london'], ['but', 'london', 'goes', 'to', 'you']]
{'london': {'go': 2, 'to': 3, 'but': 1, 'goes': 1},
'but': {'london': 1, 'goes': 1},
'goes': {'london': 1, 'but': 1, 'you': 1, 'to': 1},
'i': {'go': 1, 'to': 1},
'do': {'you': 1, 'go': 1, 'not': 1},
'you': {'do': 1, 'not': 1, 'goes': 1, 'to': 1},
'go': {'london': 2, 'i': 1, 'to': 2, 'do': 1, 'not': 1},
'not': {'do': 1, 'you': 1, 'go': 1, 'to': 1},
'to': {'london': 3, 'i': 1, 'go': 2, 'not': 1, 'goes': 1, 'you': 1}}
附言
- 自己进行文本预处理(删除标点符号、词形还原、词干提取、blahblah)
- 继续您想要的任何转换的代码。你有字典,你可以将它转换为稀疏矩阵或 pandas datframe
你可以试试这个。。
import numpy as np
import pandas as pd
ctxs = [
'krayyem like candy crush more then coffe',
'krayyem plays candy crush all days',
'krayyem do not invite his friends to play candy crush',
'krayyem is smart',
]
l_unique = list(set((' '.join(ctxs)).split(' ')))
mat = np.zeros((len(l_unique), len(l_unique)))
nei = []
nei_size = 3
for ctx in ctxs:
words = ctx.split(' ')
for i, _ in enumerate(words):
nei.append(words[i])
if len(nei) > (nei_size * 2) + 1:
nei.pop(0)
pos = int(len(nei) / 2)
for j, _ in enumerate(nei):
if nei[j] in l_unique and words[i] in l_unique:
mat[l_unique.index(nei[j]), l_unique.index(words[i])] += 1
mat = pd.DataFrame(mat)
mat.index = l_unique
mat.columns = l_unique
display(mat)