当我在 PyTorch 中使用 Adam 优化器衰减学习率时,损失会突然跳跃

人工智能 机器学习 深度学习 优化 自动编码器 目标函数
2021-11-13 22:42:51

我正在使用优化器(带有)和单通道音频源分离任务训练auto-encoder网络。每当我将学习率衰减一个因子时,网络损失就会突然跳跃然后减少,直到学习率下一次衰减。Adamamsgrad=TrueMSE loss

我正在使用 Pytorch 进行网络实施和培训。

Following are my experimental setups:

 Setup-1: NO learning rate decay, and 
          Using the same Adam optimizer for all epochs

 Setup-2: NO learning rate decay, and 
          Creating a new Adam optimizer with same initial values every epoch

 Setup-3: 0.25 decay in learning rate every 25 epochs, and
          Creating a new Adam optimizer every epoch

 Setup-4: 0.25 decay in learning rate every 25 epochs, and
          NOT creating a new Adam optimizer every time rather
          using PyTorch's "multiStepLR" and "ExponentialLR" decay scheduler 
          every 25 epochs

对于设置#2、#3、#4,我得到了非常令人惊讶的结果,我无法对此做出任何解释。以下是我的结果:

Setup-1 Results:

Here I'm NOT decaying the learning rate and 
I'm using the same Adam optimizer. So my results are as expected.
My loss decreases with more epochs.
Below is the loss plot this setup.

情节一:

Setup-1 结果

optimizer = torch.optim.Adam(lr=m_lr,amsgrad=True, ...........)

for epoch in range(num_epochs):
    running_loss = 0.0
    for i in range(num_train):
        train_input_tensor = ..........                    
        train_label_tensor = ..........
        optimizer.zero_grad()
        pred_label_tensor = model(train_input_tensor)
        loss = criterion(pred_label_tensor, train_label_tensor)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    loss_history[m_lr].append(running_loss/num_train)

Setup-2 Results:  

Here I'm NOT decaying the learning rate but every epoch I'm creating a new
Adam optimizer with the same initial parameters.
Here also results show similar behavior as Setup-1.

Because at every epoch a new Adam optimizer is created, so the calculated gradients
for each parameter should be lost, but it seems that this doesnot affect the 
network learning. Can anyone please help on this?

情节2:

Setup-2 结果

for epoch in range(num_epochs):
    optimizer = torch.optim.Adam(lr=m_lr,amsgrad=True, ...........)

    running_loss = 0.0
    for i in range(num_train):
        train_input_tensor = ..........                    
        train_label_tensor = ..........
        optimizer.zero_grad()
        pred_label_tensor = model(train_input_tensor)
        loss = criterion(pred_label_tensor, train_label_tensor)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    loss_history[m_lr].append(running_loss/num_train)

Setup-3 Results: 

As can be seen from the results in below plot, 
my loss jumps every time I decay the learning rate. This is a weird behavior.

If it was happening due to the fact that I'm creating a new Adam 
optimizer every epoch then, it should have happened in Setup #1, #2 as well.
And if it is happening due to the creation of a new Adam optimizer with a new 
learning rate (alpha) every 25 epochs, then the results of Setup #4 below also 
denies such correlation.

情节3:

Setup-3 结果

decay_rate = 0.25
for epoch in range(num_epochs):
    optimizer = torch.optim.Adam(lr=m_lr,amsgrad=True, ...........)

    if epoch % 25 == 0  and epoch != 0:
        lr *= decay_rate   # decay the learning rate

    running_loss = 0.0
    for i in range(num_train):
        train_input_tensor = ..........                    
        train_label_tensor = ..........
        optimizer.zero_grad()
        pred_label_tensor = model(train_input_tensor)
        loss = criterion(pred_label_tensor, train_label_tensor)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    loss_history[m_lr].append(running_loss/num_train)

Setup-4 Results:  

In this setup, I'm using Pytorch's learning-rate-decay scheduler (multiStepLR)
which decays the learning rate every 25 epochs by 0.25.
Here also, the loss jumps everytime the learning rate is decayed.

正如@Dennis 在下面的评论中所建议的那样,我尝试了非线性ReLU1e-02 leakyReLU非线性。但是,结果似乎表现相似,损失首先减少,然后增加,然后在一个更高的值饱和,而不是在没有学习率衰减的情况下达到的值。

图 4 显示了结果。

情节4:

在此处输入图像描述

scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer=optimizer, milestones=[25,50,75], gamma=0.25)

scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer=optimizer, gamma=0.95)

scheduler = ......... # defined above
optimizer = torch.optim.Adam(lr=m_lr,amsgrad=True, ...........)

for epoch in range(num_epochs):

    scheduler.step()

    running_loss = 0.0
    for i in range(num_train):
        train_input_tensor = ..........                    
        train_label_tensor = ..........
        optimizer.zero_grad()
        pred_label_tensor = model(train_input_tensor)
        loss = criterion(pred_label_tensor, train_label_tensor)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    loss_history[m_lr].append(running_loss/num_train)

编辑:

  • 正如下面的评论和回复中所建议的,我已经对我的代码进行了更改并训练了模型。我已经添加了相同的代码和绘图。
  • 正如@Dennis 在下面的评论中所建议的那样,我尝试了各种lr_schedulerinPyTorch (multiStepLR, ExponentialLR)和相同的情节。Setup-4
  • 按照@Dennis 在评论中的建议尝试使用leakyReLU。

任何帮助。谢谢

1个回答

我看不出为什么衰减的学习率会造成你所观察到的那种损失的跳跃。它应该“减慢”你“移动”的速度,在亏损持续缩小的情况下,在最坏的情况下,这应该只会导致你的亏损达到平稳状态(而不是那些跳跃)。

我在您的代码中观察到的第一件事是您在每个时期都从头开始重新创建优化器。我还没有充分使用 PyTorch 来确定,但这不是每次都会破坏优化器的内部状态/内存吗?我认为您应该只创建一次优化器,然后再循环遍历各个时期。如果这确实是您的代码中的一个错误,那么在您不使用学习率衰减的情况下,它实际上也应该是一个错误......但也许您只是幸运地在那里并且不会遇到相同的负面影响漏洞。

对于学习率衰减,我建议使用官方 API,而不是手动解决方案。在您的特定情况下,您需要实例化一个StepLR调度程序,其中:

  • optimizer= ADAM 优化器,您可能应该只实例化一次。
  • step_size = 25
  • gamma = 0.25

然后,您可以简单地scheduler.step()在每个纪元开始时调用(或者可能在结束时调用?API 链接中的示例在每个纪元开始时调用它)。


如果在进行上述更改后,您仍然遇到问题,那么多次运行每个实验并绘制平均结果(或所有实验的绘图线)也会很有用。理论上,你的实验在前 25 个 epoch 中应该是相同的,但是即使在没有发生学习率衰减的前 25 个 epoch 中,我们仍然可以看到这两个数字之间的巨大差异(例如,一个数字以 ~28K 的损失开始,另一个开始时损失约 40K)。这可能仅仅是由于不同的随机初始化,所以最好从你的图中平均非确定性。