让我举一个例子,安德鲁的建议比你的更好:
假设真正的梯度是 $(0, 0, 0)$,而您计算的梯度是 $(10^{-4}, 10^{-4}, 10^{-4})$。那么你的平均值将返回 $10^{-8}$,而 Andrew 的推荐将返回 $1$。您的指标可能会欺骗您认为您的梯度是正确计算的,并且错误只是由于数字问题,而 Andrew 不能欺骗您,因为它认为梯度可能非常小。(0,0,0) and the gradient you have computed is (10−4,10−4,10−4). Then your average would return 10−8, and Andrew's recommendation would return 1. Your metric could fool you into thinking that your gradient is computed propperly and the error is just due to a numeric issue, while Andrew's cannot fool you into that, due to the fact that it considers the fact that the gradient can be very small.
总结一下,如果你的梯度没有接近于零的范数,那也没关系。然而,当梯度接近于零时,你可能会误以为你的梯度是正确的,而实际上却不是。