Transformers – Layered Normalization


Transformers – Layered Normalization

Table Of Contents:

  1. What Is Normalization ?
  2. What Is Batch Normalization ?
  3. Why Batch Normalization Does Not Works On Sequential Data ?

(1) What Is Normalization?

What We Are Normalizing ?
  • Generally you normalize the input values which you pass to the neural networks and also you can normalize the output from an hidden layer.
  • Again we are normalizing the hidden layer output because again the hidden layer may produce the large range of numbers, hence we need to normalize them to bring them in a range.
Benefits Of Normalization.

(2) What Is Batch Normalization?

https://www.praudyog.com/deep-learning-tutorials/transformers-batch-normalization/

(3) Why Batch Normalization Does Not Works On Sequential Data ?

  • Suppose we have f1 and f2 as input and we will train our neural network on this input.
  • Lets apply Batch Normalization on the output of the first layer and see what’s the problem.
  • We will send data in batches to this neural network.
  • Lets take Batch Size = 5, 
  • First we will send the first row records which is (2 , 3).
  • And lets focus on the first node and its two incoming weights(w1, w2).
  • We will calculate the activation value for the 3 nodes for every input value.
  • We will have the output for first node.
  • Like this we will calculate the output for other 2 nodes.
  • We will have (Z1, Z2 ,Z3) as an output for every input feature value.
  • Now we need to normalize the Z1, Z2 and Z3 values.
  • We will calculate the mean and standard deviation of the each output and normalize them.
  • We will apply the batch normalization to the every value of the (Z1, Z2 AND Z3).
  • We will apply learnable parameter gamma and beta.
Why Do Batch Normalization Does not Works On Sequential Data?
  • In the Transformer architecture the Normalization is applied to the output of the Self Attention mechanism.
  • Let us see how the output for the self attention comes if we pass sentences as an input.
  • At a time i will send 2 sentences to my self attention model.
  • Sentence-1: Hi Nitish
  • Sentence-2: How Are You Today.
  • Hence my Batch Size = 2.
  • First I will calculate the embedding vector for each word. Like shown in below image.
  • The problem here is in the 1st sentence we have only 2 words and in the 2nd sentence we have 4 words.
  • You have to make sure the length of each sentence should be same.
  • You will use the concept of padding to make same length sentence.
  • After padding your embedding vector will looks like this.
  • Now the input to the Self Attention will looks like this.
  • And we need to calculate the contextual embedding of each words in a sentence using the  Self Attention.
  • Now we will simplify it and convert it into a matrix format.
  • And it will be passed at the same time to my Self Attention model.
  • Individual matrix shape is 4*3 and we have 2 matrices.
  • We will combine and form it as a tensor of (2 ,4 , 3) and pass it as an input.
  • In output also we will receive (2 ,4 , 3) tensor vector.
  • In the output you have contextual embedding of each word. for Hi = [0.2 0.45 0.71].
  • Since the output numbers are in different distribution of numbers.
  • we  need to apply the normalization technique to it.
  • Lets apply the Batch Normalization to the output of the Self Attention model.
  • We will stack the two matrices one after another and apply the batch normalization to it.
  • We will have 3 different dimensions d1, d2 and d3.
  • We will normalize in the column direction.
  • We will calculate the mean and variance of each dimension of the output vector.
  • We will also have the shifting and scaling learnable parameters of the individual dimension
Problem With Batch Normalization:
  • We have taken 2 sentences and the long sentence is having 4 words on it.
  • If my Batch Size = 32 and the long sentence is having 100 words in it AND OTHER SENTENCES IS HAVING 30 WORDS ON AN AVERAGE.
  • Now the Consequences will be there will be lot of zeros in the sentences having 30 words.
  • If you want to calculate the mean and standard deviation of a particular column you will see lot of zeros in that column hence the mean and standard deviation will not be the true representation of that column.
  • Because we have added artificially zeros to that column to match the length of the input sentences.
  • Hence we can’t apply Batch Normalization to the matrix to normalize the values.
  • You can apply Batch Normalization easily to the normal tabular data but you can’t apply it in a sequential data.

(4) How Layer Normalization Solves The Problem Of Batch Normalization?

  • We will see how we can apply the Layer Normalization to the same input data and the (Z1, Z2, Z3 ) as an activation outputs from the first layer.
  • When you do a Batch Normalization you do normalization across the column direction but when you do Layer normalization you normalize across the row.
  • When you do a Batch Normalization you do normalization across the column direction but when you do Layer normalization you normalize across the row.
  • You will calculate the mean and standard deviation across row.
  • In first row (7 ,5 ,4) in second row we will calculate for (2, 3, 4) etc.
  • We will normalize each value across the row and will apply the learnable parameters beta and gamma to each value.
  • Why it is more logical to apply Layer Normalization in Transformer lets explore.
  • Lets apply normalization to the self attention output embedding vector.
  • Here we will calculate the mean and standard deviation across row.
  • And the learnable parameters gamma and beta values will have for each embedding value.
  • Finally we will normalize individual values using mean and variance across that row and use the beta and gamma value.
What’s The Benefit We Got In Layered Normalization ?
  • When we do Layer Normalization across the rows it don’t have any effect of zeros present in the other rows because we are normalizing across the row.
  • Hence the Normalization process will be effective and accurate as compared to the Batch Normalization. 

Leave a Reply

Your email address will not be published. Required fields are marked *