-

Deep Learning – What Is Early Stopping ?
Deep Learning – What Is Early Stopping ? Table Of Contents: What Is Early Stopping ? Why Is Early Stopping Is Needed ? How Early Stopping Works ? Benefits Of Early Stopping . Visual Representation. Hyperparameter : Patience . (1) What Is Early Stopping ? (2) Why Is Early Stopping Needed ? (3) How Early Stopping Works ? (4) Benefits of Early Stopping . (5) Visual Representation . (6) Hyperparameter: Patience (7) Implementation in Keras (TensorFlow) from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.callbacks import EarlyStopping # 1. Build the model model = Sequential([ Dense(128, activation='relu', input_shape=(input_dim,)), Dense(64,
-

Transformer – Prediction Process
Transformer – Transformer Prediction Table Of Contents: Prediction Setup Of Transformer. Step By Step Flow Of Input Sentence, “We Are Friends!”. Decoder Processing For Other Timestep (1) Prediction Setup For Transformer. Input Dataset: For simplicity we will take this 3 rows as input but in reality we will have thousands of rows as input. We will use these dataset to train our Transformer model. Query Sentence: We will pass this sentence for translation, Sentence = “We Are Friends !” (2) Step By Step Flow Of Input Sentence, “We Are Friends!”. Transformer is mainly divides into Encoder and Decoder. Encoder will
-

Transformer – Decoder Architecture
Transformer – Decoder Architecture Table Of Contents: What Is The Work Of Decoder In Transformer ? Overall Decoder Architecture. Understanding Decoder Work Flow With An Example. Understanding Decoder 2nd Part. (1) What Is The Work Of Decoder In Transformer ? In a Transformer model, the Decoder plays a crucial role in generating output sequences from the encoded input. It is mainly used in sequence-to-sequence (Seq2Seq) tasks such as machine translation, text generation, and summarization. (2) Overall Decoder Architecture. In the original paper of Transformer we have 6 decoder module connected in series. The output from one decoder module will be
-

Transformers – Cross Attention
Transformer – Cross Attention Table Of Contents: Where Is Cross Attention Block Is Applied In Transformers? What Is Cross Attention ? How Cross Attention Works? Where We Use Cross Attention Mechanism. (1) Where Is Cross Attention Block Is Applied In Transformers? In the diagram above you can see that, the Multi-Head Attention is known as “Cross Attention”. The difference to the other “Multi Head Attention” block is that for other the 3 inputs Query, Key and Value vectors are generated from a single source but in this Cross Attention block the Query vector is coming from the Decoder block and
-

Transformer – Masked Self Attention
Transformer – Masked Self Attention Table Of Contents: Transformer Decoder Definition. What Is Autoregressive Model? Lets Prove The Transformer Decoder Definition. How To Implement The Parallel Processing Logic While Training The Transformer Decoder? Implementing Masked Self Attention. (1) Transformer Decoder Definition From this above definition we can we can understand that the Transformer behaves Autoregressive while prediction and Non Auto Regressive while training. This is displayed in the diagram below. (2) What Is Autoregressive Model? Suppose you are making a Machine Learning model which work is predict the stock price, Monday it has predicted 29, Tuesday = 25 for to
-

Transformers – Encoder Architecture
Transformers – Encoder Architecture Table Of Contents: What Is Encoder In Transformer? Internal Workings Of Encoder Module. How Encoder Module Works With An Example. Why We Use Addition Operation With The Original Input Again In Encoder Module? (1) What Is Encoder In Transformer? In a Transformer model, the Encoder is responsible for processing input data (like a sentence) and transforming it into a meaningful contextual representation that can be used by the Decoder (in tasks like translation) or directly for classification. Encoding is necessary because, it, Transforms words into numerical format (embeddings). Allows self-attention to analyze relationships between words. Adds
-

Transformers – Layered Normalization
Transformers – Layered Normalization Table Of Contents: What Is Normalization ? What Is Batch Normalization ? Why Batch Normalization Does Not Works On Sequential Data ? (1) What Is Normalization? What We Are Normalizing ? Generally you normalize the input values which you pass to the neural networks and also you can normalize the output from an hidden layer. Again we are normalizing the hidden layer output because again the hidden layer may produce the large range of numbers, hence we need to normalize them to bring them in a range. Benefits Of Normalization. (2) What Is Batch Normalization? https://www.praudyog.com/deep-learning-tutorials/transformers-batch-normalization/
-

Deep Learning – Batch Normalization.
What Is Batch Normalization ? Table Of Contents: What Is Batch Normalization ? Why Is Batch Normalization Needed ? Why Is Batch Normalization Needed ? Example Of Batch Normalization. Why is Internal Covariate Shift (ICS) a Problem If Different Distributions Are Natural? (1) What Is Batch Normalization ? Batch Normalization is a technique used in Deep Learning to speed up training and improve stability by normalizing the inputs of each layer. Batch Normalization keeps activations stable by normalizing each layer’s output. Without Batch Normalization it can lead to unstable training, slow convergence, overfitting, or underfitting. Special Note: If at every
-

Transformers – Positional Encoding in Transformers
Transformers – Positional Encoding Table Of Contents: What Is Positional Encoding In Transformers? Why Do We Need Positional Encoding? How Does Positional Encoding Works? Positional Encoding In Attention All You Need Paper. Interesting Observations In Sin & Cosine Curve. How Positional Encoding Captures The Relative Position Of The Words ? (1) What Is Positional Encoding In Transformer? Positional Encoding is a technique used in Transformers to add order (position) information to input sequences. Since Transformers do not have built-in sequence awareness (unlike RNNs), they use positional encodings to help the model understand the order of words in a sentence. (2)
-

Transformers – Multi-Head Attention in Transformers
Multi Head Attention Table Of Contents: Disadvantages Of Self Attention Mechanism. What Is Multi-Head Attention ? How Multi Headed Attention Works ? (1) Disadvantages Of Self Attention. The task is read the sentence and tell me the meaning of it. Meaning-1: An astronomer was standing and another man saw him with a telescope. Meaning-2: An astronomer was standing with a telescope and another man just saw him. In this sentence we are getting two different meaning of a single sentence. How Self Attention Will Works On This Sentence ? The self attention will find out the similarity of each word
