我正在尝试在推文上训练 wor2vec 嵌入。我将句子类定义如下:
def tokenize_tweets():
for line in codecs.open('../data/sample_tweets.txt', encoding='utf-8'):
tweet_text = ' '.join([token for token in tknz.tokenize(line) if token not in stopwords.words('english')])
try:
mod_text = tokenize(tweet_text)
tokens = tknz.tokenize(mod_text)
if len(tokens) > 0:
yield tknz.tokenize(mod_text)
else:
yield ['NULL']
except UnicodeEncodeError as e:
yield ['<NULL>']
口语。从这个类构建运行良好。但是当我尝试运行 train 方法时,我收到以下错误:
ValueError: You must specify either total_examples or total_words, for proper alpha and progress calculations. The usual value is total_examples=model.corpus_count.
不确定它有什么问题。