无法学习 Word2Vec 模型的权重

数据挖掘 机器学习 Python 张量流 word2vec 执行
2022-02-28 04:43:00

我打算通过遵循这个 TensorFlow 教程并稍微调整代码来实现一个词嵌入模型 - 即 Word2Vec 。但不幸的是,我的模型不会学到任何东西。我使用 TensorBoard 来跟踪损失函数的值,观察网络的权重如何随时间演变。这是我发现的:

  1. 损失函数的值不断上下波动
  2. 网络的权重在训练过程中保持不变

老实说,我无法理解为什么会这样。我在创建变量时尝试过明确设置“trainable=True”,但这也无济于事。这是我现在正在使用的代码:

import tensorflow as tf
import numpy as np

vocabulary_size = 13046
embedding_size = 256
num_noise = 1
learning_rate = 1e-3
batch_size = 1024
epochs = 10

def make_hparam_string(embedding_size, num_noise, learning_rate, batch_size, epochs):
    return f'es={embedding_size}_nn={num_noise}_lr={learning_rate}_bs={batch_size}_e={epochs}'

# These are the hidden layer weights
embeddings = tf.get_variable(name='embeddings', initializer=tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0), trainable=True)

# 'nce' stands for 'Noise-contrastive estimation' and represents a particular loss function.
# Check https://www.tensorflow.org/tutorials/representation/word2vec for more details.
# 'nce_weights' and 'nce_biases' are simply the output weights and biases.
# NOTE: for some reason, even though output weights will have shape (embedding_size, vocabulary_size),
#       we have to initialize them with the shape (vocabulary_size, embedding_size)
nce_weights = tf.get_variable(name='output_weights',
                              initializer=tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / np.sqrt(embedding_size)), 
                              trainable=True)
nce_biases = tf.get_variable(name='output_biases', initializer=tf.constant_initializer(0.1), shape=[vocabulary_size], trainable=True)

# Placeholders for inputs
train_inputs = tf.placeholder(tf.int32, shape=[None])    # [batch_size]
train_labels = tf.placeholder(tf.int32, shape=[None, 1]) # [batch_size, 1]

# This allows us to quickly retrieve the corresponding word embeddings for each word in 'train_inputs'
matched_embeddings = tf.nn.embedding_lookup(embeddings, train_inputs)

# Compute the NCE loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
                                     biases=nce_biases,
                                     labels=train_labels,
                                     inputs=matched_embeddings,
                                     num_sampled=num_noise,
                                     num_classes=vocabulary_size))

# Use the SGD optimizer to minimize the loss function
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)

# Add some summaries for TensorBoard
loss_summary = tf.summary.scalar('nce_loss', loss)
input_embeddings_summary = tf.summary.histogram('input_embeddings', embeddings)
output_embeddings_summary = tf.summary.histogram('output_embeddings', nce_weights)

################################################################################

# Load data
target_words = np.genfromtxt('target_words.txt', dtype=int, delimiter='\n').reshape((-1, 1))
context_words = np.genfromtxt('context_words.txt', dtype=int, delimiter='\n').reshape((-1, 1))

# Convert to tensors
target_words_tensor = tf.convert_to_tensor(target_words)
context_words_tensor = tf.convert_to_tensor(context_words)

# Create a tf.data.Dataset object representing our dataset
dataset = tf.data.Dataset.from_tensor_slices((target_words_tensor, context_words_tensor))
dataset = dataset.shuffle(buffer_size=target_words.shape[0])
dataset = dataset.batch(batch_size)

# Create an iterator to iterate over the dataset
iterator = dataset.make_initializable_iterator()
next_batch = iterator.get_next()

# Train the model
with tf.Session() as session:

    # Initialize variables
    session.run( tf.global_variables_initializer() )

    merged_summary = tf.summary.merge_all()

    # File writer for TensorBoard
    hparam_string = make_hparam_string(embedding_size, num_noise, learning_rate, batch_size, epochs)
    loss_writer = tf.summary.FileWriter(f'./tensorboard/{hparam_string}')

    global_step = 0
    for epoch in range(epochs):

        session.run(iterator.initializer)
        while True:
            try:
                inputs, labels = session.run(next_batch)

                feed_dict = {train_inputs: inputs[:, 0], train_labels: labels}
                _, cur_loss, all_summaries = session.run([optimizer, loss, merged_summary], feed_dict=feed_dict)

                # Write sumaries to disk
                loss_writer.add_summary(all_summaries, global_step=global_step)
                global_step += 1

                print(f'Current loss: {cur_loss}')

            except tf.errors.OutOfRangeError:
                print(f'Finished epoch {epoch}.')
                break

1个回答

从您的代码看来,您正在运行 10 个时期。你的模型不太可能在这么少的时期内取得重大进展。您可能会在 1000 个 epoch 之后开始看到学习,但是 word2vec 的大规模实现通常需要数百万个 epoch 并且需要几个月的时间才能训练到可接受的水平。