标准动量将具有以下步骤:
马上,重新计算新的动量:
μt+1=μt⋅(decayScalar)+(learnRate)⋅∇
通过这个新动量调整权重
θt+1:=θt−μt+1
涅斯捷罗夫动量是这样的:
大跃进:任意校正权重μ到目前为止,我们拥有:
θt+1:=θt−μ⋅(decayScalar)
计算梯度∇从新的权重θt+1
通过这个梯度校正这些权重(现在没有任何动量):
θt+2:=θt+1−(learnRate)⋅∇
最后,重新计算动量如下:
μ:=θt+2−θt
因此,动量在最后更新。它从“大跳跃前的权重”变成一个向量,指向“新鲜梯度校正后的权重”。
参考:Geoffrey Hinton Lecture 6C Corsera
重新安排:
为了避免将梯度计算停留在优化器函数的中间(步骤 2 和 3),我们可以重新安排如下:
计算我们目前拥有的权重的梯度。
通过梯度校正这些权重(现在没有任何动量),如下所示:
θt+1:=θt−(learnRate)⋅∇
更新动量:
μ:=θt+1−θcached
θcached:=θt+1
大跳跃
θt+2:=θt+1−μ⋅(decayScalar)
请注意,这样第 2、3、4 步都在我们的优化器中。我们可以在优化器之外(在步骤 1 中)计算梯度,使我们的代码更具可读性:)
size_t _numApplyCalled = 0;
//Nesterov Accelerated Gradient.
//Placed at the end of a backprop, should be followed by a usual forward-propagation
// https://datascience.stackexchange.com/a/26395/43077
//
void apply( float *toChange, float *newGrad, size_t count ){
float learnRate = get(OptimizerVar::LEARN_RATE);//scalar
float momentumCoeff = get(OptimizerVar::MOMENTUM_1);//scalar
const bool isFirstEver_apply = _numApplyCalled == 0;
for (int i=0; i<_arraySize; ++i){
//correction by gradient alone:
toChange[i] -= newGrad[i]*learnRate;
// determine momentum:
if (isFirstEver_apply){//nothing was cached yet.
_momentumVals[i] = 0.0f;
}
else {
_momentumVals[i] = toChange[i] - _weightsCached[i];
}
//caching, AFTER momentum calc, but BEFORE the jump:
_weightsCached[i] = toChange[i];
//jump:
toChange[i] -= _momentumVals[i] * momentumCoeff;
}//end for
++_numApplyCalled;//increments by 1
}