Aurelien Geron 书中的预训练神经网络示例

数据挖掘 机器学习 Python 神经网络 张量流 自动编码器
2022-02-27 07:18:18

我正在测试 Aurélien Géron 的书“Hands-On Machine Learning with Scikit-Learn and TensorFlow”第 15 章中的预训练示例。代码在他的 github 页面上:这里- 请参阅“无监督预训练”部分中的示例。

使用来自先前训练的编码器的权重对网络进行预训练应该有助于训练网络。为了检查这一点,我稍微修改了 Aurelien 的代码,使其在每批后输出错误,并减少了批大小。我这样做是为了在训练开始时看到错误,预训练权重的效果应该是最明显的。我预计预训练的网络会以较低的错误开始(与不使用预训练的网络相比),因为它是从预训练的权重开始的。然而,预训练似乎使训练变慢。

有谁知道为什么会这样?

输出的前几行(使用预训练时)是:

0 Train accuracy after each mini-batch: 0.08
0 Train accuracy after each mini-batch: 0.24
0 Train accuracy after each mini-batch: 0.32
0 Train accuracy after each mini-batch: 0.2
0 Train accuracy after each mini-batch: 0.32
0 Train accuracy after each mini-batch: 0.26
0 Train accuracy after each mini-batch: 0.32
0 Train accuracy after each mini-batch: 0.5
0 Train accuracy after each mini-batch: 0.58
0 Train accuracy after each mini-batch: 0.48
0 Train accuracy after each mini-batch: 0.54
0 Train accuracy after each mini-batch: 0.48
0 Train accuracy after each mini-batch: 0.5
0 Train accuracy after each mini-batch: 0.56
0 Train accuracy after each mini-batch: 0.64
0 Train accuracy after each mini-batch: 0.56
0 Train accuracy after each mini-batch: 0.68
0 Train accuracy after each mini-batch: 0.62
0 Train accuracy after each mini-batch: 0.74
0 Train accuracy after each mini-batch: 0.78

如您所见,最初的准确性很低。相比之下,当使用 He 初始化权重(即不使用预训练)时,初始准确率实际上更高:

0 Train accuracy after each mini-batch: 0.62
0 Train accuracy after each mini-batch: 0.5
0 Train accuracy after each mini-batch: 0.52
0 Train accuracy after each mini-batch: 0.38
0 Train accuracy after each mini-batch: 0.56
0 Train accuracy after each mini-batch: 0.56
0 Train accuracy after each mini-batch: 0.6
0 Train accuracy after each mini-batch: 0.7
0 Train accuracy after each mini-batch: 0.72
0 Train accuracy after each mini-batch: 0.86
0 Train accuracy after each mini-batch: 0.86
0 Train accuracy after each mini-batch: 0.8
0 Train accuracy after each mini-batch: 0.82
0 Train accuracy after each mini-batch: 0.84
0 Train accuracy after each mini-batch: 0.88
0 Train accuracy after each mini-batch: 0.9
0 Train accuracy after each mini-batch: 0.82
0 Train accuracy after each mini-batch: 0.9
0 Train accuracy after each mini-batch: 0.84
0 Train accuracy after each mini-batch: 0.98
0 Train accuracy after each mini-batch: 0.96

换句话说,预训练似乎会减慢训练速度,这与它应该做的相反!

我修改后的代码是:

import numpy as np
import sys
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data


def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)


def train_stacked_autoencoder():
    reset_graph()

    # Load the dataset to use
    mnist = input_data.read_data_sets("/tmp/data/")

    n_inputs = 28 * 28
    n_hidden1 = 300
    n_hidden2 = 150  # codings
    n_hidden3 = n_hidden1
    n_outputs = n_inputs

    learning_rate = 0.01
    l2_reg = 0.0001

    activation = tf.nn.elu
    regularizer = tf.contrib.layers.l2_regularizer(l2_reg)
    initializer = tf.contrib.layers.variance_scaling_initializer()

    X = tf.placeholder(tf.float32, shape=[None, n_inputs])

    weights1_init = initializer([n_inputs, n_hidden1])
    weights2_init = initializer([n_hidden1, n_hidden2])
    weights3_init = initializer([n_hidden2, n_hidden3])
    weights4_init = initializer([n_hidden3, n_outputs])

    weights1 = tf.Variable(weights1_init, dtype=tf.float32, name="weights1")
    weights2 = tf.Variable(weights2_init, dtype=tf.float32, name="weights2")
    weights3 = tf.Variable(weights3_init, dtype=tf.float32, name="weights3")
    weights4 = tf.Variable(weights4_init, dtype=tf.float32, name="weights4")

    biases1 = tf.Variable(tf.zeros(n_hidden1), name="biases1")
    biases2 = tf.Variable(tf.zeros(n_hidden2), name="biases2")
    biases3 = tf.Variable(tf.zeros(n_hidden3), name="biases3")
    biases4 = tf.Variable(tf.zeros(n_outputs), name="biases4")

    hidden1 = activation(tf.matmul(X, weights1) + biases1)
    hidden2 = activation(tf.matmul(hidden1, weights2) + biases2)
    hidden3 = activation(tf.matmul(hidden2, weights3) + biases3)
    outputs = tf.matmul(hidden3, weights4) + biases4

    reconstruction_loss = tf.reduce_mean(tf.square(outputs - X))

    optimizer = tf.train.AdamOptimizer(learning_rate)

    with tf.name_scope("phase1"):
        phase1_outputs = tf.matmul(hidden1, weights4) + biases4  # bypass hidden2 and hidden3
        phase1_reconstruction_loss = tf.reduce_mean(tf.square(phase1_outputs - X))
        phase1_reg_loss = regularizer(weights1) + regularizer(weights4)
        phase1_loss = phase1_reconstruction_loss + phase1_reg_loss
        phase1_training_op = optimizer.minimize(phase1_loss)

    with tf.name_scope("phase2"):
        phase2_reconstruction_loss = tf.reduce_mean(tf.square(hidden3 - hidden1))
        phase2_reg_loss = regularizer(weights2) + regularizer(weights3)
        phase2_loss = phase2_reconstruction_loss + phase2_reg_loss
        train_vars = [weights2, biases2, weights3, biases3]
        phase2_training_op = optimizer.minimize(phase2_loss, var_list=train_vars) # freeze hidden1

    init = tf.global_variables_initializer()
    saver = tf.train.Saver()

    training_ops = [phase1_training_op, phase2_training_op]
    reconstruction_losses = [phase1_reconstruction_loss, phase2_reconstruction_loss]
    n_epochs = [4, 4]
    batch_sizes = [150, 150]

    use_cached_results = True

    # Train both phases
    if not use_cached_results:
        with tf.Session() as sess:
            init.run()
            for phase in range(2):
                print("Training phase #{}".format(phase + 1))
                for epoch in range(n_epochs[phase]):
                    n_batches = mnist.train.num_examples // batch_sizes[phase]
                    for iteration in range(n_batches):
                        print("\r{}%".format(100 * iteration // n_batches), end="")
                        sys.stdout.flush()
                        X_batch, y_batch = mnist.train.next_batch(batch_sizes[phase])
                        sess.run(training_ops[phase], feed_dict={X: X_batch})
                    loss_train = reconstruction_losses[phase].eval(feed_dict={X: X_batch})
                    print("\r{}".format(epoch), "Train MSE:", loss_train)
                    saver.save(sess, "./my_model_one_at_a_time.ckpt")
            loss_test = reconstruction_loss.eval(feed_dict={X: mnist.test.images})
            print("Test MSE (uncached method):", loss_test)

    # Train both phases, but in this case we cache the frozen layer outputs
    if use_cached_results:
        with tf.Session() as sess:
            init.run()
            for phase in range(2):
                print("Training phase #{}".format(phase + 1))
                if phase == 1:
                    hidden1_cache = hidden1.eval(feed_dict={X: mnist.train.images})
                for epoch in range(n_epochs[phase]):
                    n_batches = mnist.train.num_examples // batch_sizes[phase]
                    for iteration in range(n_batches):
                        print("\r{}%".format(100 * iteration // n_batches), end="")
                        sys.stdout.flush()
                        if phase == 1:
                # Phase 2 - use the cached output from hidden layer 1
                            indices = np.random.permutation(mnist.train.num_examples)
                            hidden1_batch = hidden1_cache[indices[:batch_sizes[phase]]]
                            feed_dict = {hidden1: hidden1_batch}
                            sess.run(training_ops[phase], feed_dict=feed_dict)
                        else:
                # Phase 1
                            X_batch, y_batch = mnist.train.next_batch(batch_sizes[phase])
                            feed_dict = {X: X_batch}
                            sess.run(training_ops[phase], feed_dict=feed_dict)
                    loss_train = reconstruction_losses[phase].eval(feed_dict=feed_dict)
                    print("\r{}".format(epoch), "Train MSE:", loss_train)
                    saver.save(sess, "./my_model_cache_frozen.ckpt")
            loss_test = reconstruction_loss.eval(feed_dict={X: mnist.test.images})
            print("Test MSE (cached method):", loss_test)


def unsupervised_pretraining():
    reset_graph()

    # Load the dataset to use
    mnist = input_data.read_data_sets("/tmp/data/")

    n_inputs = 28 * 28
    n_hidden1 = 300
    n_hidden2 = 150
    n_outputs = 10

    learning_rate = 0.01
    l2_reg = 0.0005

    activation = tf.nn.elu
    regularizer = tf.contrib.layers.l2_regularizer(l2_reg)
    initializer = tf.contrib.layers.variance_scaling_initializer()

    X = tf.placeholder(tf.float32, shape=[None, n_inputs])
    y = tf.placeholder(tf.int32, shape=[None])

    weights1_init = initializer([n_inputs, n_hidden1])
    weights2_init = initializer([n_hidden1, n_hidden2])
    weights3_init = initializer([n_hidden2, n_outputs])

    weights1 = tf.Variable(weights1_init, dtype=tf.float32, name="weights1")
    weights2 = tf.Variable(weights2_init, dtype=tf.float32, name="weights2")
    weights3 = tf.Variable(weights3_init, dtype=tf.float32, name="weights3")

    biases1 = tf.Variable(tf.zeros(n_hidden1), name="biases1")
    biases2 = tf.Variable(tf.zeros(n_hidden2), name="biases2")
    biases3 = tf.Variable(tf.zeros(n_outputs), name="biases3")

    hidden1 = activation(tf.matmul(X, weights1) + biases1)
    hidden2 = activation(tf.matmul(hidden1, weights2) + biases2)
    logits = tf.matmul(hidden2, weights3) + biases3

    cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    reg_loss = regularizer(weights1) + regularizer(weights2) + regularizer(weights3)
    loss = cross_entropy + reg_loss
    optimizer = tf.train.AdamOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

    init = tf.global_variables_initializer()
    pretrain_saver = tf.train.Saver([weights1, weights2, biases1, biases2])
    saver = tf.train.Saver()

    n_epochs = 4
    batch_size = 50
    n_labeled_instances = 2000

    pretraining = True

    # Regular training (without pretraining):
    if not pretraining:
        with tf.Session() as sess:
            init.run()
            for epoch in range(n_epochs):
                n_batches = n_labeled_instances // batch_size
                for iteration in range(n_batches):
                    #print("\r{}%".format(100 * iteration // n_batches), end="")
                    #sys.stdout.flush()
                    indices = np.random.permutation(n_labeled_instances)[:batch_size]
                    X_batch, y_batch = mnist.train.images[indices], mnist.train.labels[indices]
                    sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
                    accuracy_val = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
                    print("\r{}".format(epoch), "Train accuracy after each mini-batch:", accuracy_val)
                    sys.stdout.flush()
                accuracy_val = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
                print("\r{}".format(epoch), "Train accuracy after all batched:", accuracy_val, end=" ")
                saver.save(sess, "./my_model_supervised.ckpt")
                accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
                print("Test accuracy (without pretraining):", accuracy_val)

    # Now reuse the first two layers of the autoencoder we pretrained:
    if pretraining:
        training_op = optimizer.minimize(loss, var_list=[weights3, biases3])  # Freeze layers 1 and 2 (optional)
        with tf.Session() as sess:
            init.run()
            pretrain_saver.restore(sess, "./my_model_cache_frozen.ckpt")
            for epoch in range(n_epochs):
                n_batches = n_labeled_instances // batch_size
                for iteration in range(n_batches):
                    #print("\r{}%".format(100 * iteration // n_batches), end="")
            #sys.stdout.flush()
                    indices = np.random.permutation(n_labeled_instances)[:batch_size]
                    X_batch, y_batch = mnist.train.images[indices], mnist.train.labels[indices]
                    sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
                    accuracy_val = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
                    print("\r{}".format(epoch), "Train accuracy after each mini-batch:", accuracy_val)
                    sys.stdout.flush()
                accuracy_val = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
                print("\r{}".format(epoch), "Train accuracy after all batched:", accuracy_val, end=" ")
                saver.save(sess, "./my_model_supervised_pretrained.ckpt")
                accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
                print("Test accuracy (with pretraining):", accuracy_val)



if __name__ == "__main__":
    # Seed the random number generator
    np.random.seed(42)
    tf.set_random_seed(42)

    # Fit a multi-layer autoencoder and save the weights
    # - this part is from Aurelien Geron's Ch 15, "Training one Autoencoder at a time in a single graph" example
    train_stacked_autoencoder()

    # Fit a network, using the weights previously saved for pretraining
    # - this part is from Aurelien Geron's Ch 15, "Unsupervised pretraining" example
    unsupervised_pretraining()
1个回答

[注意:我自己没有完成 Aurélien Geron 的教程,但我读过这本书]


在直观的层面上,我可以说服自己对于预训练模型的训练实际上会更慢。换句话说,错误减少(或准确性增加)的速度可能较低是有道理的。训练准确性较低的事实(至少对我而言)有点复杂,也许是特定于案例的。

学习率

然而,预训练似乎使训练变慢。

使用预训练模型,我们基本上采用了一组权重,这些权重已经(至少部分)针对一个问题进行了优化。他们致力于根据收到的数据集解决该问题,这意味着他们希望输入对应于某个分布。你已经用这条线冻结了前两层:

if pretraining: training_op = optimizer.minimize(loss, var_list=[weights3, biases3])

冻结两层(在您的情况下是三层),直观地限制了模型。

这是一个有点做作的类比,我可能会用它来向自己解释这种情况。想象一下,我们有一个可以玩三个球的小丑,但现在我们希望他们学会使用第四个球。同时,我们要求一个业余爱好者学习如何玩杂耍,也有四个球。在测量他们的学习速度之前,我们决定将小丑的一只手绑在背后。所以小丑已经知道了一些技巧,但在学习过程中也受到了某种限制。在我看来,业余爱好者很可能会学得更快(相对而言),因为还有更多要学的东西——而且还因为他们有更多的自由来探索参数空间,即他们可以使用双臂更自由地移动。

在优化的设置中,人们可能会想象预训练模型在损失曲线上的位置已经在某些维度上梯度非常小的位置(不要忘记,我们有一个高维搜索空间)。这最终意味着它不能在反向传播误差的同时快速更改权重的输出,因为权重更新是这些可能很小的优化权重的倍数。

...好吧-听起来似乎有道理,但这仅解决了学习缓慢的问题-实际训练精度低于随机初始化模型的事实又如何呢?

初始训练精度

我预计预训练的网络会以较低的错误开始(与不使用预训练的网络相比)......

在这里,我倾向于同意你的看法。在最佳情况下,我们可以采用预训练模型,原样使用初始层,然后微调最终层。但是,在某些情况下,这可能不起作用。

查阅相关文献,论文摘要中有一个可能的解释:深度神经网络中的特征有多可迁移?(约辛斯基等人)

可迁移性受到两个不同问题的负面影响:(1)更高层神经元对其原始任务的专业化以牺牲目标任务的性能为代价,这是预期的,以及(2)与在协同适应之间分割网络相关的优化困难神经元,这是意料之外的。

我发现第二个原因特别有趣并且与您的设置相关。这是因为您实际上只有三层。因此,您不允许必须自由进行微调,并且最后一层可能非常依赖于它与前一层的关系。

作为使用预训练模型的结果,您可能期望看到的是最终模型表现出更好的泛化性。这可能确实以您训练的特定数据集的保留集的测试准确性较低为代价。

这是另一个想法,由惊人的(免费的)斯坦福 CS231n课程很好地总结:

学习率与计算新数据集的类别分数的新线性分类器的(随机初始化)权重相比,对正在微调的 ConvNet 权重使用较小的学习率是很常见的。

在您的代码中,所有学习阶段的学习率似乎都是固定的0.01这是您可以尝试的东西;使预训练层更小,或者只是从全局较低的学习率开始。

这是对迁移学习的全面介绍,可能会给您一些关于为什么/在哪里可能做出不同建模决策的更多想法。