【深度学习】Bi-RNN | GRU | LSTM
https://www.cnblogs.com/zhaopAC/p/10240968.html
基于梯度的神经网络(eg back propagation)的梯度消失
This is not a fundamental problem with neural networks - it's a problem with gradient based learning methods caused by certain activation functions. Let's try to intuitively understand the problem and the cause behind it.
cause: Many common activation functions (e.g sigmoid or tanh) 'squash' their input into a very small output range
-> even a large change in the input will produce a small change in the output - hence the gradient is small.
当有multiple layers,This becomes much worse
每一层的input都被映射到a smaller output region
As a result, even a large change in the parameters of the first layer doesn't change the output much.
GRU是如何解决梯度消失与膨胀的?https://www.cnblogs.com/bonelee/p/10475453.html