推文上的 Gensim word2vec 训练错误

数据挖掘 word2vec gensim
2022-02-15 16:16:34

我正在尝试在推文上训练 wor2vec 嵌入。我将句子类定义如下:

def tokenize_tweets():
    for line in codecs.open('../data/sample_tweets.txt', encoding='utf-8'):
        tweet_text = ' '.join([token for token in tknz.tokenize(line) if token not in stopwords.words('english')])
        try:
            mod_text = tokenize(tweet_text)
            tokens = tknz.tokenize(mod_text)
            if len(tokens) > 0:
                yield tknz.tokenize(mod_text)
            else:
                yield ['NULL']
        except UnicodeEncodeError as e:
            yield ['<NULL>']

口语。从这个类构建运行良好。但是当我尝试运行 train 方法时,我收到以下错误:

ValueError: You must specify either total_examples or total_words, for proper alpha and progress calculations. The usual value is total_examples=model.corpus_count.

不确定它有什么问题。

2个回答

在较新的 W2V 版本中,仅编写以下内容是不够的:

model_name.train(sentences)

你必须在里面写word_counttotal_examples

例如,我写:

model_name.train(sentences, total_examples = token_count, epochs = model_name.iter )

哪里token_count = sum([len(sentence) for sentence in sentences])这是我得到句子的方式:

sentences = []
for raw_sentence in raw_sentences:

    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

更多文档在这里

请参阅此 github 存储库以获取解决方案 - https://github.com/RaRe-Technologies/gensim/issues/1284