Dropout¶
The Dropout layer implements a powerful regularization technique widely used in deep neural networks to prevent overfitting. During training, it randomly zeroes some of the elements of the input tensor with probability $p$. This forces the network to learn more robust features and prevents the co-adaptation of neurons.
Mathematical Definition: Inverted Dropout¶
sorix uses a common implementation called Inverted Dropout. During training, for each element $x$ of the input tensor, the output $y$ is computed as:
$$ y = \begin{cases} 0 & \text{with probability } p \\ \frac{x}{1-p} & \text{with probability } 1-p \end{cases} $$
Why do we use the scaling factor $\frac{1}{1-p}$?¶
When we drop neurons during training, the total "signal" passing through the layer is reduced. If we have $100$ neurons and $p=0.5$, on average $50$ neurons will be active. Without any compensation, the next layer would receive only half of the expected activation energy compared to a case where all neurons are active.
To ensure that the model behaves correctly during both training and inference without changing the architecture, we must maintain the Expected Value ($\mathbb{E}$) of the activations constant.
1. Expected Value during Training¶
Let $x$ be the input value. The expected value of the output $y$ during training (using the scaling factor $s$) is:
$$\mathbb{E}[y_{train}] = p \cdot (0) + (1-p) \cdot (x \cdot s)$$ $$\mathbb{E}[y_{train}] = (1-p) \cdot x \cdot s$$
2. Expected Value during Inference¶
During inference (evaluation mode), all neurons are active ($p=0$), so the mask is effectively all $1$s and no scaling is applied in sorix:
$$\mathbb{E}[y_{eval}] = 1 \cdot x = x$$
3. Solving for $s$¶
To make the training phase match the inference phase in terms of activation magnitude, we set $\mathbb{E}[y_{train}] = \mathbb{E}[y_{eval}]$:
$$(1-p) \cdot x \cdot s = x \implies s = \frac{1}{1-p}$$
This is why we divide by $1-p$. By doing this during training, we don't need to do anything special during inference, making the deployment phase identical to a standard identity mapping.
Backward Computation (Gradient)¶
During the backward pass, the gradient $\frac{\partial \mathcal{L}}{\partial y}$ is propagated only through the elements that were not zeroed during the forward pass. The scaling factor $1/(1-p)$ is also applied to the gradient to maintain consistency.
Let $M$ be the binary mask used in the forward pass ($M_i = 1$ with probability $1-p$, and $M_i = 0$ otherwise). The output was computed as: $$y = \frac{x \odot M}{1-p}$$
The gradient w.r.t the input $x$ is: $$\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \odot \frac{M}{1-p}$$
In simple terms: if a neuron was "killed" (zeroed) in the forward pass, it does not contribute to the gradient in the backward pass. If it survived, its gradient is scaled just like its value was.
Training vs Inference¶
Dropout behaves differently depending on the model's mode:
- Training Mode (
model.train()): A random binary mask is generated. Elements are zeroed with probability $p$ and survivors are scaled by $1/(1-p)$. - Evaluation Mode (
model.eval()): Dropout is bypassed ($y = x$). The full power of the ensemble-like trained network is used.
Functional View¶
The Dropout layer is a zero-parameter layer. It has a hyperparameter $p$, but it does not learn weights via backpropagation. However, it does affect the gradient flow.
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'
import numpy as np
from sorix import tensor
from sorix.nn import Dropout
import sorix
Demonstrating Signal Conservation¶
Let's see how the average value of a tensor stays roughly the same after passing through Dropout with scaling.
# Create a large tensor of ones
n_elements = 100000
X = tensor(np.ones((1, n_elements)))
print(f"Average of input: {np.mean(X.data):.4f}")
p = 0.3
dropout = Dropout(p=p)
# 1. In Training mode (Expected signal should be ~1.0)
dropout.train()
Y_train = dropout(X)
print(f"Average in TRAINING (p={p}): {np.mean(Y_train.data):.4f}")
# 2. In Eval mode (Signal is exactly 1.0)
dropout.eval()
Y_eval = dropout(X)
print(f"Average in EVALUATION: {np.mean(Y_eval.data):.4f}")
Average of input: 1.0000 Average in TRAINING (p=0.3): 1.0006 Average in EVALUATION: 1.0000
Backward Pass Example¶
We can observe how the gradient is zeroed for dropped elements and scaled for others.
X = tensor([10.0, 20.0, 30.0, 40.0, 50.0], requires_grad=True)
print(f"Input X: {X.data}")
p = 0.5
dropout = Dropout(p=p)
dropout.train()
Y = dropout(X)
print(f"Output Y: {Y.data}")
# We compute the gradient of the sum of outputs
Y.sum().backward()
print(f"Gradients of X: {X.grad}")
print(f"Scaling factor (1/(1-p)): {1/(1-p):.2f}")
Input X: [10. 20. 30. 40. 50.] Output Y: [ 0. 40. 0. 80. 100.] Gradients of X: tensor([0., 2., 0., 2., 2.], dtype=sorix.float64) Scaling factor (1/(1-p)): 2.00