How LSTM Network Solves The Issue Of Vanishing Gradient Problem?


How Vanishing Gradient Problems Solved By The LSTM ?

Table Of Contents:

  1. What Is Vanishing Gradient Problem?
  2. Why Does It Occurs In RNN?

(1) What Is Vanishing Gradient Problem?

  • The vanishing gradient problem is a challenge that occurs during the training of deep neural networks, particularly in networks with many layers.
  • It arises when the gradients of the loss function, which are used to update the network’s weights, become extremely small as they are backpropagated through the layers of the network.
  • This is the gradient calculation with only one layer.
  • With only one layer you can see that three terms multiplying together.
  • If you have 10 hidden layers you will have around 30 different terms multiplying together.
  • If we multiply smaller terms multiple times he result will be even smaller.
  • As a result the gradient of the error term will be negligible. 
  • Hence the weights updating will not happen or it will happen too slowly.

(2) Why Does RNN Suffers From Vanishing Gradient Problem?

  • Suppose you are passing a sequence of 10 words.
  • Here in RNN we need to calculate 2 different weights and 1 biases.
  • Wh and Wx are the weights.
  • b is the bias terms.
  • The weights and biases are being calculated using the below formula.
  • Here the point to note is that the weights are not calculated during the training process, these new weights are calculated once you pass all the words in a sequence.
  • Suppose you have 10 words in a sequence the new weight will be  calculated once we accumulate gradients for all the time steps. 
  • Below is the calculation of the gradient for all 10 time steps.
  • You need to sum up all the individual gradients up to that point.
  • You pass a single word and calculate the loss and calculate the gradient for that time step word.
  • I fyou have two words you will sum up two gradients like that it will go on.
  • If you individually calculate the gradients , lets see for L10.
  • It will looks like below equation.
  • Here you can see that the gradients uses the chain rule which is the multiplication of different terms.
  • Once you have a smaller terms in any part, the entire equation will be smaller if we multiply it together.

Note:

  • In RNN long term memory issue happens because the earlier words will be forgotten by the RNN network due to the vanishing gradient issue.
  • The 10 word from first, when we calculate its gradient it will be too small or close to zero because of its repeated multiplication.
  • Hence its contribution toward the weight update will be forgotten 

(3) How LSTM Solves The Vanishing Gradient Problem?

  • LSTM implement the 3 different gates.
  • LSTM implements two different states one is Cell State and another one the the Hidden State.
  • Cell state is used to store long term memory.
  • Hidden state is used to hold the short term memory.
  • It also has Forgot Gate, input Gate and The Output Gate.
  • Until unless Forgot Gate tells the network to forget about some words the Long term memory cell state will keep that work till the end.
  • Hence in this way LSTM remembers the longterm context easily.

Is LSTM also uses the chain rule for longer sequences the gradients will be small again ?

  • Yes, LSTMs use the chain rule for backpropagation through time (BPTT), which means that the gradients are computed by propagating errors backward through all time steps in the sequence.
  • While LSTMs mitigate the vanishing gradient problem better than vanilla RNNs, the issue can still occur for very long sequences due to the following reasons:

Leave a Reply

Your email address will not be published. Required fields are marked *