数据挖掘 - 推文上的 Gensim word2vec 训练错误 - 吾爱随笔录

推文上的 Gensim word2vec 训练错误

数据挖掘 word2vec gensim

2022-02-15 16:16:34

我正在尝试在推文上训练 wor2vec 嵌入。我将句子类定义如下：

def tokenize_tweets():
    for line in codecs.open('../data/sample_tweets.txt', encoding='utf-8'):
        tweet_text = ' '.join([token for token in tknz.tokenize(line) if token not in stopwords.words('english')])
        try:
            mod_text = tokenize(tweet_text)
            tokens = tknz.tokenize(mod_text)
            if len(tokens) > 0:
                yield tknz.tokenize(mod_text)
            else:
                yield ['NULL']
        except UnicodeEncodeError as e:
            yield ['<NULL>']

口语。从这个类构建运行良好。但是当我尝试运行 train 方法时，我收到以下错误：

ValueError: You must specify either total_examples or total_words, for proper alpha and progress calculations. The usual value is total_examples=model.corpus_count.

不确定它有什么问题。

2个回答

在较新的 W2V 版本中，仅编写以下内容是不够的：

model_name.train(sentences)

你必须在里面写word_count或total_examples。

例如，我写：

model_name.train(sentences, total_examples = token_count, epochs = model_name.iter )

哪里token_count = sum([len(sentence) for sentence in sentences])。这是我得到句子的方式：

sentences = []
for raw_sentence in raw_sentences:

    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

更多文档在这里。

请参阅此 github 存储库以获取解决方案 - https://github.com/RaRe-Technologies/gensim/issues/1284

其它你可能感兴趣的问题

上一篇缺失值的插补和分类值的处理下一篇为什么当源目标反转时 LSTM 表现更好？(Seq2seq)