数据挖掘 - 如何使用预训练的模型权重初始化新的 word2vec 模型？ - 吾爱随笔录

如何使用预训练的模型权重初始化新的 word2vec 模型？

数据挖掘 Python nlp 词嵌入 word2vec gensim

2021-09-21 23:58:02

我在 python 中使用 Gensim Library 来使用和训练 word2vector 模型。最近，我正在考虑使用一些预训练的 word2vec 模型（例如 GoogleNewDataset 预训练模型）来初始化我的模型权重。我已经为此苦苦挣扎了几个星期。现在，我刚刚发现在 gesim 中有一个函数可以帮助我使用预训练的模型权重初始化模型的权重。

这在下面提到：

reset_from(other_model)

    Borrow shareable pre-built structures (like vocab) from the other_model. Useful if testing multiple models in parallel on the same corpus.

我不知道这个函数能不能做同样的事情。请帮忙！

4个回答

感谢阿布舍克。我已经想通了！这是我的实验。

1）。我们绘制一个简单的例子：

from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
            ['this', 'is', 'the', 'second', 'sentence'],
            ['yet', 'another', 'sentence'],
            ['one', 'more', 'sentence'],
            ['and', 'the', 'final', 'sentence']]
# train model
model_1 = Word2Vec(sentences, size=300, min_count=1)

# fit a 2d PCA model to the vectors
X = model_1[model_1.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model_1.wv.vocab)
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

从上图中我们可以看出，简单的句子无法通过距离来区分不同单词的含义。

2）。加载预训练的词嵌入：

from gensim.models import KeyedVectors

model_2 = Word2Vec(size=300, min_count=1)
model_2.build_vocab(sentences)
total_examples = model_2.corpus_count
model = KeyedVectors.load_word2vec_format("glove.6B.300d.txt", binary=False)
model_2.build_vocab([list(model.vocab.keys())], update=True)
model_2.intersect_word2vec_format("glove.6B.300d.txt", binary=False, lockf=1.0)
model_2.train(sentences, total_examples=total_examples, epochs=model_2.iter)

# fit a 2d PCA model to the vectors
X = model_2[model_1.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model_1.wv.vocab)
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

从上图中，我们可以看到词嵌入更有意义。
希望这个答案会有所帮助。

让我们看一个示例代码：

>>>from gensim.models import word2vec

#let us train a sample model like yours
>>>sentences = [['first', 'sentence'], ['second', 'sentence']]
>>>model1 = word2vec.Word2Vec(sentences, min_count=1)

#let this be the model from which you want to reset
>>>sentences = [['third', 'sentence'], ['fourth', 'sentence']]
>>>model2 = word2vec.Word2Vec(sentences, min_count=1)
>>>model1.reset_from(model2)
>>>model1.similarity('third','sentence')
-0.064622000988260417

因此，我们观察到它model1被 themodel2和因此这个词重置，'third'并且'sentence'在它的词汇表中最终给出了它的相似性。这是基本用途，您还可以检查reset_weights()以将权重重置为未训练/初始状态。

如果您正在寻找一个预训练的词嵌入网络，我建议您使用 GloVe。Keras 的以下博客非常详细地介绍了如何实现这一点。它还具有指向预先训练的 GloVe 嵌入的链接。有预训练的词向量，范围从 50 维向量到 300 维向量。它们建立在 Wikipedia、Common Crawl Data 或 Twitter 数据之上。你可以在这里下载它们。此外，您应该查看keras 博客以了解如何实现它们。

我已经在我的 github 存储库中完成了它。

看看这是不是你需要的。

其它你可能感兴趣的问题

上一篇XGBRegressor 与 xgboost.train 巨大的速度差异？下一篇你用什么在 R 中生成仪表板？