-

Transformers – Syllabus
-

Transformers
Transformers – Syllabus admin April 14, 2025 Transformers Read More Encoder Decoder Architecture admin June 10, 2024 Deep Learning Tutorials,Transformers Read More What Is Attention Mechanism? admin June 13, 2024 Deep Learning Tutorials,Transformers Read More Bahdanau Attention Vs Luong Attention ! admin June 16, 2024 Deep Learning Tutorials,Transformers Read More
-

Transformer – Prediction Process
Transformer – Transformer Prediction Table Of Contents: Prediction Setup Of Transformer. Step By Step Flow Of Input Sentence, “We Are Friends!”. Decoder Processing For Other Timestep (1) Prediction Setup For Transformer. Input Dataset: For simplicity we will take this 3 rows as input but in reality we will have thousands of rows as input. We will use these dataset to train our Transformer model. Query Sentence: We will pass this sentence for translation, Sentence = “We Are Friends !” (2) Step By Step Flow Of Input Sentence, “We Are Friends!”. Transformer is mainly divides into Encoder and Decoder. Encoder will
-

Transformer – Decoder Architecture
Transformer – Decoder Architecture Table Of Contents: What Is The Work Of Decoder In Transformer ? Overall Decoder Architecture. Understanding Decoder Work Flow With An Example. Understanding Decoder 2nd Part. (1) What Is The Work Of Decoder In Transformer ? In a Transformer model, the Decoder plays a crucial role in generating output sequences from the encoded input. It is mainly used in sequence-to-sequence (Seq2Seq) tasks such as machine translation, text generation, and summarization. (2) Overall Decoder Architecture. In the original paper of Transformer we have 6 decoder module connected in series. The output from one decoder module will be
-

Transformers – Cross Attention
Transformer – Cross Attention Table Of Contents: Where Is Cross Attention Block Is Applied In Transformers? What Is Cross Attention ? How Cross Attention Works? Where We Use Cross Attention Mechanism. (1) Where Is Cross Attention Block Is Applied In Transformers? In the diagram above you can see that, the Multi-Head Attention is known as “Cross Attention”. The difference to the other “Multi Head Attention” block is that for other the 3 inputs Query, Key and Value vectors are generated from a single source but in this Cross Attention block the Query vector is coming from the Decoder block and
-

Transformer – Masked Self Attention
Transformer – Masked Self Attention Table Of Contents: Transformer Decoder Definition. What Is Autoregressive Model? Lets Prove The Transformer Decoder Definition. How To Implement The Parallel Processing Logic While Training The Transformer Decoder? Implementing Masked Self Attention. (1) Transformer Decoder Definition From this above definition we can we can understand that the Transformer behaves Autoregressive while prediction and Non Auto Regressive while training. This is displayed in the diagram below. (2) What Is Autoregressive Model? Suppose you are making a Machine Learning model which work is predict the stock price, Monday it has predicted 29, Tuesday = 25 for to
-

Transformers – Encoder Architecture
Transformers – Encoder Architecture Table Of Contents: What Is Encoder In Transformer? Internal Workings Of Encoder Module. How Encoder Module Works With An Example. Why We Use Addition Operation With The Original Input Again In Encoder Module? (1) What Is Encoder In Transformer? In a Transformer model, the Encoder is responsible for processing input data (like a sentence) and transforming it into a meaningful contextual representation that can be used by the Decoder (in tasks like translation) or directly for classification. Encoding is necessary because, it, Transforms words into numerical format (embeddings). Allows self-attention to analyze relationships between words. Adds
-

Transformers – Layered Normalization
Transformers – Layered Normalization Table Of Contents: What Is Normalization ? What Is Batch Normalization ? Why Batch Normalization Does Not Works On Sequential Data ? (1) What Is Normalization? What We Are Normalizing ? Generally you normalize the input values which you pass to the neural networks and also you can normalize the output from an hidden layer. Again we are normalizing the hidden layer output because again the hidden layer may produce the large range of numbers, hence we need to normalize them to bring them in a range. Benefits Of Normalization. (2) What Is Batch Normalization? https://www.praudyog.com/deep-learning-tutorials/transformers-batch-normalization/
-

Transformers – Positional Encoding in Transformers
Transformers – Positional Encoding Table Of Contents: What Is Positional Encoding In Transformers? Why Do We Need Positional Encoding? How Does Positional Encoding Works? Positional Encoding In Attention All You Need Paper. Interesting Observations In Sin & Cosine Curve. How Positional Encoding Captures The Relative Position Of The Words ? (1) What Is Positional Encoding In Transformer? Positional Encoding is a technique used in Transformers to add order (position) information to input sequences. Since Transformers do not have built-in sequence awareness (unlike RNNs), they use positional encodings to help the model understand the order of words in a sentence. (2)
-

Transformers – Multi-Head Attention in Transformers
Multi Head Attention Table Of Contents: Disadvantages Of Self Attention Mechanism. What Is Multi-Head Attention ? How Multi Headed Attention Works ? (1) Disadvantages Of Self Attention. The task is read the sentence and tell me the meaning of it. Meaning-1: An astronomer was standing and another man saw him with a telescope. Meaning-2: An astronomer was standing with a telescope and another man just saw him. In this sentence we are getting two different meaning of a single sentence. How Self Attention Will Works On This Sentence ? The self attention will find out the similarity of each word
