What Is Dropout Layer?
Table Of Contents:
- What Is Dropout Layer?
- What Happens In Training Stage ?
- What Happens In Testing Stage ?
- Why We Need To Scale The Weights After Training, When Using Dropouts?
(1) What Is Dropout Layer ?
- The Dropout Layer is a regularization technique used in deep learning neural networks to prevent overfitting.
- Overfitting occurs when a model performs exceptionally well on the training data but fails to generalize well to new, unseen data.
- The Dropout Layer works by randomly “dropping out” (i.e., temporarily deactivating) a proportion of the neurons in a neural network during the training process.
- This means that on each training iteration, a random subset of the neurons are ignored, and their weights are not updated.
(2) What Happens In Training Stage ?
Neuron Deactivation: On each training iteration, a randomly selected subset of the neurons in the Dropout Layer are temporarily “dropped out” or deactivated. This means that the outputs of these neurons are set to zero, and their weights are not updated during the backpropagation step.
Scaled Weights: To compensate for the reduced number of active neurons, the weights of the remaining active neurons are scaled up by a factor of 1 / (1 – p), where p is the Dropout rate. This ensures that the overall output of the layer remains roughly the same during both training and inference.
Stochastic Regularization: By randomly dropping out neurons, the Dropout Layer introduces stochasticity into the training process. This has the effect of regularizing the model, which helps to prevent overfitting and improve generalization.
Ensemble Effect: The Dropout Layer can be seen as creating an ensemble of smaller sub-networks within the larger network. During each training iteration, a different set of neurons is dropped out, effectively training a different sub-network. The final model is then an ensemble of these sub-networks, which can lead to improved performance.
No Dropout during Inference: During the inference (or testing) stage, the Dropout Layer is typically disabled, and all the neurons are active. This ensures that the model can make predictions on new, unseen data using the full capacity of the network.
(3) What Happens In Testing Stage ?
No Dropout: Unlike the training stage, the Dropout Layer is typically disabled during the testing/inference stage. This means that all the neurons in the Dropout Layer are active and are not randomly dropped out.
No Scaling of Weights: Since all the neurons are active during testing, there is no need to scale the weights of the neurons as was done during training. The weights remain at their learned values without any scaling.
Deterministic Prediction: Without the stochastic nature of dropout, the network’s output during testing becomes deterministic. This means that for a given input, the network will always produce the same output.
Ensemble Averaging: While the Dropout Layer is disabled during testing, the model can be thought of as an ensemble of the different sub-networks that were trained during the training phase. The final prediction is essentially an average of the predictions made by these sub-networks.
Reduced Overfitting: The Dropout technique applied during training helps to reduce overfitting, as the network is forced to learn more robust features that generalize well to new, unseen data. This leads to better performance on the test set compared to a network trained without Dropout.
(4) Why We Need To Scale The Weights After Training, When Using Dropouts.
Maintaining Output Magnitude: During the training stage, when Dropout is applied, a random subset of the neurons in a layer are temporarily deactivated (their outputs are set to zero). This means that the effective number of active neurons in the layer is reduced.
Compensating for Reduced Neurons: If the weights of the remaining active neurons were not scaled, the overall output of the layer would be reduced due to the smaller number of active neurons. This would lead to a shift in the distribution of activations throughout the network, which could negatively impact the training process.
Weight Scaling Formula: To compensate for the reduced number of active neurons, the weights of the remaining active neurons are scaled up by a factor of 1 / (1 – p), where p is the Dropout rate. For example, if the Dropout rate is 0.5, the weights are scaled up by a factor of 2 (1 / (1 – 0.5)).
Maintaining Consistency: The weight scaling ensures that the expected output magnitude of the layer remains the same during both training (with Dropout) and testing (without Dropout). This helps to maintain the consistency of the network’s behaviour and prevents the need for additional adjustments or normalization steps.
Improved Convergence: The weight scaling allows the network to converge more effectively during training, as the activations and gradients remain within a similar range, regardless of the Dropout pattern applied.
Conclusion:
Without the weight scaling, the network would have to learn to compensate for the changing output magnitudes caused by the Dropout layer, which could slow down the training process and make it more difficult for the network to converge to an optimal solution.
In summary, the weight scaling is a crucial component of the Dropout technique, ensuring that the network maintains a consistent behavior and facilitates efficient training, leading to better generalization performance on new, unseen data.
(5) What happens in testing stage when the weights values are set to zero due to dropout layer ?
No Dropout Applied: During the testing/inference stage, the Dropout layer is typically disabled, meaning that all the neurons in the Dropout layer are active, and none of them are randomly dropped out.
Weights Restored to Trained Values: Since the Dropout layer is not applied during testing, the weights of the network are restored to the values they were trained to, without any scaling or modifications.
Impact of Zeroed Weights: The weights that were previously set to zero due to the Dropout layer during training are now active and contribute to the network’s output during testing.
Ensemble Averaging Effect: The network’s output during testing can be seen as an ensemble average of the different sub-networks that were trained during the Dropout-enabled training process. The presence of the previously zeroed weights contributes to this ensemble effect.
Performance Implications: The presence of the previously zeroed weights in the network during testing can impact the model’s performance in the following ways:
- If the zeroed weights were not important for the network’s performance, their reactivation during testing may not have a significant impact.
- If the zeroed weights were important for the network’s performance, their reactivation during testing can help the network achieve better generalization and improved performance on the test set.
Conclusion:
In summary, during the testing/inference stage, the weights that were previously set to zero due to the Dropout layer are restored to their trained values and contribute to the network’s output. This can impact the model’s performance, depending on the importance of the previously zeroed weights for the overall task.
The deterministic nature of the testing stage, without the stochastic Dropout process, ensures that the network’s behavior is consistent and representative of its true performance on new, unseen data.
My Note:
- If the Epochs = 100, we will train our model 100 times.
- If the dropout ratio = 0.2, we will randomly drop 20 % of neurons from the layer.
- For 100 training epochs, we will get 100 different neural networks due to the addition of the dropout.
- During the training stage, we need to scale up the weights to adjust the weight loss due to inactive neurons.
- But in the testing stage, we will restore the original values of the weights without any scaled adjustments.
- In the testing stage, we will get 100 different neural networks with their unique weights.
- We will use the voting mechanism by considering 100 neural network outputs.
- The weights and biases of the 100 different NN are restored to the original value i.e without scaled values.
(6) How Dropout Prevent Overfitting The Dataset ?

