数据挖掘 - 如何在python中实现单词到单词的共现矩阵 - 吾爱随笔录

如何在python中实现单词到单词的共现矩阵

数据挖掘 Python word2vec

2022-02-16 14:11:45

为了以这样的方式实现共现矩阵，即 word1 在 word2 的上下文中出现在给定值附近的次数，假设为 5。有 100 个单词和一个包含 1000 个句子的列表。那么如何使用python计算大小（100 * 100）的共现矩阵？

2个回答

from nltk.tokenize import word_tokenize
from itertools import combinations
from collections import Counter

sentences = ['i go to london', 'you do not go to london','but london goes to you']
vocab = set(word_tokenize(' '.join(sentences)))
print('Vocabulary:\n',vocab,'\n')
token_sent_list = [word_tokenize(sen) for sen in sentences]
print('Each sentence in token form:\n',token_sent_list,'\n')

co_occ = {ii:Counter({jj:0 for jj in vocab if jj!=ii}) for ii in vocab}
k=2

for sen in token_sent_list:
    for ii in range(len(sen)):
        if ii < k:
            c = Counter(sen[0:ii+k+1])
            del c[sen[ii]]
            co_occ[sen[ii]] = co_occ[sen[ii]] + c
        elif ii > len(sen)-(k+1):
            c = Counter(sen[ii-k::])
            del c[sen[ii]]
            co_occ[sen[ii]] = co_occ[sen[ii]] + c
        else:
            c = Counter(sen[ii-k:ii+k+1])
            del c[sen[ii]]
            co_occ[sen[ii]] = co_occ[sen[ii]] + c

# Having final matrix in dict form lets you convert it to different python data structures
co_occ = {ii:dict(co_occ[ii]) for ii in vocab}
display(co_occ)

输出：

Vocabulary:
 {'london', 'but', 'goes', 'i', 'do', 'you', 'go', 'not', 'to'} 

Each sentence in token form:
 [['i', 'go', 'to', 'london'], ['you', 'do', 'not', 'go', 'to', 'london'], ['but', 'london', 'goes', 'to', 'you']] 

{'london': {'go': 2, 'to': 3, 'but': 1, 'goes': 1},
 'but': {'london': 1, 'goes': 1},
 'goes': {'london': 1, 'but': 1, 'you': 1, 'to': 1},
 'i': {'go': 1, 'to': 1},
 'do': {'you': 1, 'go': 1, 'not': 1},
 'you': {'do': 1, 'not': 1, 'goes': 1, 'to': 1},
 'go': {'london': 2, 'i': 1, 'to': 2, 'do': 1, 'not': 1},
 'not': {'do': 1, 'you': 1, 'go': 1, 'to': 1},
 'to': {'london': 3, 'i': 1, 'go': 2, 'not': 1, 'goes': 1, 'you': 1}}

附言

自己进行文本预处理（删除标点符号、词形还原、词干提取、blahblah）
继续您想要的任何转换的代码。你有字典，你可以将它转换为稀疏矩阵或 pandas datframe

你可以试试这个。。

import numpy as np
import pandas as pd

ctxs = [
    'krayyem like candy crush more then coffe',
    'krayyem plays candy crush all days',
    'krayyem do not invite his friends to play candy crush',
    'krayyem is smart',
]

l_unique = list(set((' '.join(ctxs)).split(' ')))
mat = np.zeros((len(l_unique), len(l_unique)))

nei = []
nei_size = 3

for ctx in ctxs:
    words = ctx.split(' ')

    for i, _ in enumerate(words):
        nei.append(words[i])

        if len(nei) > (nei_size * 2) + 1:
            nei.pop(0)

        pos = int(len(nei) / 2)
        for j, _ in enumerate(nei):
           if nei[j]  in l_unique and words[i] in l_unique:
              mat[l_unique.index(nei[j]), l_unique.index(words[i])] += 1

mat = pd.DataFrame(mat)
mat.index = l_unique
mat.columns = l_unique
display(mat)

其它你可能感兴趣的问题

上一篇我可以在具有 1000 行的数据集上使用 xgboost 来解决分类问题吗？下一篇处理缺失值以优化多项式特征