Transformer – Prediction Process

Transformer – Transformer Prediction

Transformer is mainly divides into Encoder and Decoder.
Encoder will behaves in same way at the time of training and also in prediction.
The main difference is in Decoder part, it will behave differently at training and inference time.
At training time i have all the output sequence in my dataset hence I can send them all at a time to decoder.
But while prediction I will not have any information in prior what my future sequence will be. Hence we have to do sequential processing of the input.
During training time our Decoder behaves in a Non Auto Regressive way but wile prediction it has to behave in a Autoregressive way.
At the time of prediction we will send first <Start> token, then our decoder will know its time for prediction.

Step-1: Encoder Query Processing

First the query sentence “We are friends” will go to the Encoder block.
Like this we will have 6 encoder block. The output from one encoder will go to the input to another encoder.
Encoder block will do the processing and produce the contextual word embeddings of individual words.
Now the decoder will have the summary of the query sentence.

Step-2: Decoder Query Processing

Here you can notice that we have only one encoder that will handle all the 3 words “Hum Dost He” at a time. This shows it has Non Auto Regressive in nature.
But we can see that we have 4 Decoder working for different time steps to handle the output. This shows it is Autoregressive in nature.

End To End Flow Of The Decoder Architecture:

Understanding Stage -1:

Step-1: To start the decoding processing we have to send <Start> as the input token to the Decoder.
Step-2: This is a English word the machine can’t understand hence we have to convert into embedding vector. Let’s pass it to the Embedding layer. It will produce 512 dimension embedding vector.
Step-3: Generate the positional embedding vector of the input word <SOS>.
Step-4: Add this positional embedding with the word embedding to form a new vector called ‘x1’.
Step-5: Finally ‘X1’ will go to the Decoder as input.

Understanding Stage -2:

Step-1: Masked Multi Head Attention

Step-1: You will feed output from stage -1 which is X1 to the Decoder as input.
Step-2: First operation is Masked Multi Head Attention ,
Step-3: Next thing is we have to generate the Query(Wq), Key(Wk) and Value(Wv) matrix from training.
Step-4: We will do dot product of X1 with Wq, Wk and Wv to get the Query , Key and Value vector for the word <SOS>.
Step-5: We will do the dot product of Query and Key vector to get the attention score as a scalar quantity (0.9). This 0.9 represents the similarity of <SOS> word with itself.
Step-6: We will multiply this scalar value with the Value vector to get the contextual embedding vector of the token <SOS>.

Step-2: Addition & Normalization

Step-1: The next step here is Add & Normalized.
Step-2: We will add our original input vector (X1) with the Normalized vector (Z1). to retain some originality in the output vector. The output vector will be (Z1′).
Step-3: Now we will normalize our output vector (Z1′) to stabilize our prediction process.
Step-4: Finally we will get the (Znorm) vector as an output.

Understanding Stage -3:

Now we will understand the Cross Attention & Add & Normalization layer.
We will now find out the similarity between the <SOS> with the encoder input sentence individual words “We”, “are”, “friends”.

Step-1: We will take output from Stage-2 which is Znorm and pass it to the Cross Attention Block.
Stage-2: In Cross Attention block we will have Wk, Wv and Wq vector. These vectors we will get from training process.
Step-3: Generally what we go we generate Query, Key and Value vectors from a single input word which in our case should be <SOS>, but in Cross attention mechanism we will use two different inputs. One is from decoder output and another is from encoder output.
Step-4: Now we will do dot product of decoder output Znorm with the Wq (Query Matrix) and as a result we will get the Query Vector.
Step-5: Now we will do the dot product of the Encoder output “We” , “are”, “fine” with the Wk(Key Matrix) and Wv(Value Matrix).
Step-6: Each word of the Encoder output “We” , “are”, “fine” will get multiplied with both (Wk and Wq) matrix. “We” will get multiplied with Wk and Wv matrix and will result Key(We) and Value(We) vector.
Step-7: Similarly we will multiply “are” vector with with Wk and Wv matrix and will result Key(are) and Value(are) vector.
Step-8: Now for each word we have Query , Key and the Value vector. We will do the dot product of Query with the Key vector to get the attention score.
Step-9: Perform dot product of Query vector Q<SOS> with the encoders Key vectors of each word. Q(<SOS>). K(We) = W3, Q(<SOS>). K(are) = W2, Q(<SOS>). K(friends) = W1. Now we Got weights (W1, W2 ,W3) as SCALAR VALUES.
Step-10: Now we will multiply the weights (W1, W2 ,W3) with the Value vectors (V(We), V(are), V(friends)) for each words of the Encoder output.
Step-11: As an output we will get the Cross Attention contextual vector.
Step-12: Now we will do the Addition operation of the input to this stage-3 which is Znorm with the Cross Attention output.

Understanding Stage -4:

Now we will understand the Feed Forward layer and the same Add & Norm layer.

Feed forward neural network will add some non linearith to the embedding vector.

Understanding Stage -5:

Pass the result of 1st decoder to the rest of 5 decoder sequentially fashion.

Understanding Stage -6:

Now the final step is to make the prediction, for this we will use the Linear layer and the softmax layer.

The number of neurons in the linear layer will be the unique number of vocabulary in your Hindi sentence in your dataset.
If unique words = 1000, V = 1000 number of neurons.
Now input to this Linear layer will be of dimension = (1 * 512)
Number of weights of the linear layer = (512 * 1000)
We will do the dot product = (1 * 512) . (512 * 1000) = (1 * 1000) as an output.
this (1 * 1000) vector will have any range of values.
These individual values represents each unique words in the vocabulary.
Now we will normalize this (1 * 1000) vector by passing it into the Softmax layer.
The node with maximum probability will be considered as out output token.

(3) Decoder Processing For Other Timestep

Output from the 1st timestep “Hum” with the <SOS> will be sent as the input to the 2nd Timestep.
And similar processing will happen as we have seen above.
One difference is qat the end to the Linear Layer we will pass only the embeddings of the “Hum” not both <SOS> and “Hum”.
At prediction time also we will do the masking operation.

Praudyog

Transformer – Prediction Process

Transformer – Transformer Prediction

Table Of Contents:

(1) Prediction Setup For Transformer.

Input Dataset:

Query Sentence:

(2) Step By Step Flow Of Input Sentence, “We Are Friends!”.

Step-1: Encoder Query Processing

Step-2: Decoder Query Processing

End To End Flow Of The Decoder Architecture:

Understanding Stage -1:

Understanding Stage -2:

Step-1: Masked Multi Head Attention

Step-2: Addition & Normalization

Understanding Stage -3:

Understanding Stage -4:

Understanding Stage -5:

Understanding Stage -6:

(3) Decoder Processing For Other Timestep

Leave a Reply Cancel reply