数据挖掘 - 如何使用 Gensim 加载 FastText 预训练模型？ - 吾爱随笔录

如何使用 Gensim 加载 FastText 预训练模型？

数据挖掘 nlp gensim

2021-09-25 21:45:52

我试图从这里加载 fastText 预训练模型Fasttext 模型。我正在使用wiki.simple.en

from gensim.models.keyedvectors import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)

但是，它显示以下错误

Traceback (most recent call last):
  File "nltk_check.py", line 28, in <module>
    word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)
  File "P:\major_project\venv\lib\sitepackages\gensim\models\keyedvectors.py",line 206, in load_word2vec_format
     header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "P:\major_project\venv\lib\site-packages\gensim\utils.py", line 235, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

问题 1如何使用 Gensim 加载 fasttext 模型？

问题2另外，加载模型后，我想找到两个单词之间的相似度

 model.find_similarity('teacher', 'teaches')
 # Something like this
 Output : 0.99

我该怎么做呢？

4个回答

这是 gensim fasttext.py中可用于 fasttext 实现的方法的链接

from gensim.models.wrappers import FastText

model = FastText.load_fasttext_format('wiki.simple')

print(model.most_similar('teacher'))
# Output = [('headteacher', 0.8075869083404541), ('schoolteacher', 0.7955552339553833), ('teachers', 0.733420729637146), ('teaches', 0.6839243173599243), ('meacher', 0.6825737357139587), ('teach', 0.6285147070884705), ('taught', 0.6244685649871826), ('teaching', 0.6199781894683838), ('schoolmaster', 0.6037642955780029), ('lessons', 0.5812176465988159)]

print(model.similarity('teacher', 'teaches'))
# Output = 0.683924396754

对于.bin使用：（load_fasttext_format()这通常包含带有参数、ngram 等的完整模型）。

对于.vec使用：（load_word2vec_format这仅包含词向量 -> 没有 ngrams + 你不能更新模型）。

注意:: 如果您遇到内存问题或无法加载 .bin 模型，请检查pyfasttext模型是否相同。

致谢：Ivan Menshikh（Gensim 维护者）

更新 04/2020

load_fasttext_format()现在已弃用，更新的方法是加载模型分别使用gensim.models.fasttext.load_facebook_model()或gensim.models.fasttext.load_facebook_vectors()用于二进制文件和 vecs。

例如：

from gensim.models.fasttext import load_facebook_model

wv = load_facebook_model('<path_to.bin.gz>')

我真的很想使用 gensim，但最终发现使用本机fasttext库对我来说效果更好。您可以将以下代码复制/粘贴到 google colab 中，并且开箱即用：

pip install fasttext

import fasttext.util
fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')

也适用于词汇以外的单词：

ft.get_word_vector("another")
ft.get_word_vector("dkjeri37id20hnd")

其它你可能感兴趣的问题

上一篇机器学习中的 Logits 是什么意思？下一篇超调 XGBoost 参数