为什么机器学习的激活函数中不使用阶跃函数?

机器算法验证 机器学习 神经网络
2022-01-23 05:44:39

我在实践中看到的激活函数要么是 sigmoid,要么是 tanh。为什么不使用阶跃函数?在神经网络的激活函数中使用阶跃函数有什么不好?使用阶跃函数有什么影响?sigmoid/tanh 在哪些方面优于 step?

4个回答

我们不能在(深度)神经网络中使用 Heaviside 阶跃函数有两个主要原因:

  1. 目前,训练多层神经网络的最有效方法之一是使用梯度下降和反向传播。反向传播算法的一个要求是一个可微的激活函数。但是,Heaviside 阶跃函数在 x = 0 处不可微分,并且在其他地方具有 0 导数。这意味着梯度下降将无法在更新权重方面取得进展。
  2. 回想一下,神经网络的主要目标是学习权重和偏差的值,以便模型可以产生尽可能接近真实值的预测。为了做到这一点,就像在许多优化问题中一样,我们希望权重或偏差的微小变化仅导致网络输出的微小相应变化。通过这样做,我们可以不断调整权重值和偏差以产生最佳近似值。拥有一个只能生成 0 或 1(或者是和否)的函数将无助于我们实现这一目标。

正如其他人所回答的那样,主要原因是它在反向传播期间无法正常工作。然而,除了其他人所写的内容之外,重要的是要注意,无处不在的可微性并不是神经网络中反向传播的必要条件,因为也可以使用子导数。例如,参见 ReLU 激活函数,它在 0 处也是不可微分的(https://en.wikipedia.org/wiki/Rectifier_(neural_networks)

为什么不使用阶跃函数?在神经网络的激活函数中使用阶跃函数有什么不好?

我假设您的意思是 Heaviside 阶跃函数

H(x)={1x00x<0.
The key feature of H is not that the gradients are sometimes zero, it's that the gradients are almost always zero.

H has gradient 0 everywhere except at x=0. This means that optimizing a model (e.g. neural network) using gradient-based methods because the gradient is almost always zero. (Indeed, its derivative is the Dirac delta function.) This means that the weights will almost always never move because the gradient step has zero length.

Contrasting H to other functions such as ReLU should make it clear why H is unsuitable but other functions with 0 gradient portions can succeed nonetheless.

  • Models using the ReLU have shown marked success, even though the gradient is zero whenever x<0. This is because usually not all inputs attain 0 gradient, so weights and biases for x>0 will still update as usual. (However, having 0 gradient "on the left" does give rise to a problem similar to the problems with the Heaviside step function: some weight configurations will always be zero, so these weights are "stuck" and never updated. This is called the dying ReLU phenomenon.)
  • A function having a negligible set where the gradient is not defined is not fatal. The ReLU derivative is not defined at x=0 (though the ReLU function is subdifferentiable), but this is inconsequential, both because (1) it rarely happens that floating point arithmetic gives x=0 exactly and (2) we can just fudge it by using some number in [0,1] as the gradient for that solitary point -- this arbitrary choice does not make an enormous difference to the final model. (Of course, using a smoother function instead of ReLU will avoid this entirely.)

What are the effects of using step function?

  • There are steep shifts from 0 to 1, which may not fit the data well.
  • The network is not differentiable, so gradient-based training is impossible.

In what ways are sigmoid/tanh superior over step?

The sigmoid layer which combines the affine transformation and the nonlinear activation can be written as

σ(x)=11+exp(axb).
For certain a, we can view σ as a smooth approximation to the step function. A special case of σ which has a very large will behave very similarly to H, in the sense that there is a rapid increase from 0 to 1, just as we have with H.

Likewise, if we need a decreasing function or a constant function, these are also special cases of σ for different values of a. Successful model training will find a,b that achieve a low loss, i.e. choose the parameters which are shallow or steep as required to fit the data well.

But when using σ, we still get differentiability, which is the key to training the network.


Since we differentiate the activation function in back propagation process to find optimal weight values, we need to have an activation function that is suitable for differentiation.

There mainly 2 types of activation functions:

*Linear Functions

*Non Linear Functions

Linear Functions:

1.Identity function:f(x)=x, f'(x)=1

It is too simple

2.Step function:f(x)=1 if x>=0, f(x)=0 if x<0

It is discontinous so cannot be differentiated

Also linear functions only work on linearly separable inputs.

So we use non linear functions like sigmoid,tanh,ReLU which are continous and have good differentiation results.