数据挖掘 - 帮助重用手套词嵌入预训练模型 - 吾爱随笔录

帮助重用手套词嵌入预训练模型

数据挖掘 Python nlp 文本挖掘特征提取词嵌入

2022-03-01 13:00:00

使用预训练的 GloVe.6B 进行嵌入生成时，如何仅获取文件中最常用的 100000 个单词而不是文件中的所有 4M 个单词？

2个回答

我在使用手套时遇到了类似的问题。假设你有一个文本形式的数据集，你想从中收集前 100000 个单词，你必须列出这些单词。在手套文件中，每个嵌入都在单独的一行上，每一行都以单词本身开头，然后是嵌入。您必须编写代码来将您的单词列表与 glove 文件中的单词进行比较，并提取命中的行。看看这里的示例代码。

你可以试试这个方法：

from keras.preprocessing.text import Tokenizer
from gensim.models import KeyedVectors

# X is the corpus
# GLOVE_DIR is the glove model
# EMBEDDING_DIM is the embedding demension of glove model

VOVAB_SIZE = 10000
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
word_index = tokenizer.word_index

glove_model = KeyedVectors.load_word2vec_format(GLOVE_DIR, binary=True)

num_words = min(VOCAB_SIZE, len(word_index) + 1)
embedding_matrix = np.zeros((len(num_words) + 1, EMBEDDING_DIM))

for word, i in word_index.items():
    if i < VOVAB_SIZE:
        if word in set(glove_model.wv.index2word):
            embedding_matrix[i] = glove_model[word]
        else:
            embedding_matrix[i] = np.random.rand(1, EMBEDDING_DIM)

embedding_matrix 是您的语料库中出现频率最高的 10000 个单词

其它你可能感兴趣的问题

上一篇可以使用序列到序列模型将代码从一种编程语言转换为另一种编程语言吗？下一篇如何读取二进制交叉熵的输出？