数据挖掘 - 梯度检查：MeanSquareError。为什么巨大的 epsilon 会改善差异？ - 吾爱随笔录

梯度检查：MeanSquareError。为什么巨大的 epsilon 会改善差异？

数据挖掘梯度下降反向传播坡度

2022-02-08 22:11:10

我正在使用自定义 C++ 代码，并编写了一个简单的“均方误差”层。暂时将其用于“分类任务”，而不是简单的回归。...也许这会导致问题？

在这一层之前我没有其他任何东西——甚至没有一个简单的密集层。它本身就是 MSE。它的输入是输入特征行的集合。例如，这里有 8 行输入特征，它们将一次全部传递给 MSE：

{ a0, a1, a2, a3, a4, a5, a6, a7 }
{ b0, b1, b2, b3, b4, b5, b6, b7 }
{ c0, c1, c2, c3, c4, c5, c6, c7 }
{ d0, d1, d2, d3, d4, d5, d6, d7 }
{ e0, e1, e2, e3, e4, e5, e6, e7 }
{ f0, f1, f2, f3, f4, f5, f6, f7 }
{ g0, g1, g2, g3, g4, g5, g6, g7 }
{ h0, h1, h2, g3, h4, h5, h6, h7 }    //8x8 matrix (contains 64 different values)

这个矩阵的每一行都被传递到我的“均方误差”层，为这样的一行返回一个标量：“成本”。

然后我计算一个“最终误差”标量，它是这些成本的平均值。

在进行梯度检查时，我正在研究当我扰乱 64 个输入值中的每一个时这个“最终错误”数量如何变化，如上所示。finalError这个想法是，相对于我的 64 个输入值，变化必须对应于公式计算的梯度。如果它们匹配，那么我已经正确编码了反向传播。

这是前向道具：

f i n a l E r r o r = \frac{1}{r} \sum^{r} (\frac{1}{2 n} \sum^{n} (i n p u t_{i} - t a r g e t)^{2})

$finalError = \frac{1}{r}\sum^r{ \left( \frac{1}{2n}\sum^n{(input_i-target)^2} \right) }$

在哪里 $n$ 是每行的特征数，并且 $r$ 是行数。

这是我的反向传播正在使用的输入值之一的梯度：

\frac{\partial f i n a l E r r o r}{\partial i n p u t_{i}} = \frac{1}{r n} (i n p u t_{i} - t a r g e t)

$\frac{\partial finalError}{\partial input_i} = \frac{1}{rn}(input_i - target)$

问题：

我将每个输入值“向上”，然后“向下”，向前运行道具 64*2 = 128 次。这为我的 64 个输入值提供了梯度的数值估计。

然而，当使用较小的epsilon时，这个数值估计和实际分析梯度变得不太相似。这对我来说是违反直觉的。相反，当我对 epsilon 使用大得离谱的值时，我的向量几乎完全匹配，例如 $1$

这是预期的，还是我在 C++ 代码中有错误？

这是伪代码

for every input value i:
   i -= EPSILON
   finalCost_down =  fwdprop( inputMatrix )//very simple - just computes final cost via MSE layer.  finalCost_down is a scalar.
   i += EPSILON
   finalCost_up   =  fwdprop( inputMatrix ) 
   gradientEstimate[i] = (finalCost_up - finalCost_down) / (2*EPSILON)

//after the loop, some time later, just one invocation of backprop:
trueGradientVec = backprop( vec )

//some time later:

discrepancyScalar =  (gradientEstimate - trueGradientVec).magnitude / gradientEstimate.magnitude + trueGradientVec.magnitude)

//somehow discrepancyScalar decreases the larger the EPSILON was used:
// discrepancy is 0.00275, if EPSILON is 0.0001
// discrepancy is 0.00025, if EPSILON is 0.001
// discrepancy is 2.198e-05, if EPSILON is 0.01
// discrepancy is 3.149e-06, if EPSILON is 0.1
// discrepancy is 2.751e-07, if EPSILON is 1

我希望当 epsilon 减少时差异会减少，因为更精细的扰动应该给出更精确的斜率估计......

Andrew NG 对 GradientChecking 的解释

1个回答

这是由浮点数的数值精度引起的。当我们改变一个输入值时，它变得非常明显 $\epsilon$ 这（在我上面的例子中）从本质上影响成本函数。

棘手的是 - 因为成本函数是均方误差 (MSE)，并且我的网络中没有其他层，我们确实可以使用任何 epsilon 来估计斜率。即使是荒谬的 epsilon 也会起作用，并且在数值上会更加稳定，这解释了为什么差异似乎变得更好。就是这样 $y=x^2$ 作品。

实际上，在进行梯度检查时，我切换到另一个成本函数，它是线性的（而不是 MSE）。这提高了数值稳定性，允许我使用小 10 倍的 epsilon：

C = \sum^{r} \sum^{n} (o b t a i n e d - e x p e c t e d)

$C=\sum^r\sum^n(obtained-expected)$

梯度很简单：

\frac{\partial C}{\partial (o b t a i n e d)_{r n}} = 1

$\frac{\partial C}{\partial (obtained)_{rn}} = 1$

我不会在生产中使用这个成本，只是为了梯度检查。

所以，是的，这解释了为什么大 epsilon 正在改善差异。

编辑：

请注意，如果您之前有一个 softmax，则不得使用此线性成本函数。那是因为 softmax 的总和总是 1.0

在这种情况下，您必须使用 MSE

其它你可能感兴趣的问题

上一篇剖析和理解 Adam 优化的公式下一篇使用不同大小的训练集进行反向传播？