CrossEntropyLoss in Deep Learning¶
In this notebook, we dive deep into one of the most fundamental loss functions for classification: Cross-Entropy Loss. We will explore its information-theory roots, the role of the Softmax function, and why combining them is both mathematically elegant and computationally efficient.
1. What is Cross-Entropy?¶
In information theory, Entropy measures the average level of "information", "surprise", or "uncertainty" inherent in a variable's possible outcomes.
Cross-Entropy ($H(P, Q)$) measures the average number of bits needed to identify an event from a set of possibilities if we use a probability distribution $Q$ (the model's prediction) instead of the true distribution $P$.
Mathematically, for a discrete distribution: $$H(P, Q) = - \sum_{c=1}^{C} P(c) \ln(Q(c))$$
In classification:
- $P$ is the true distribution (usually a one-hot vector where the real class has probability 1).
- $Q$ is the predicted distribution (the output of our model after Softmax).
Since $P(c) = 1$ only for the correct class $y$ and $0$ for others, the formula simplifies to: $$\text{Loss} = - \ln(Q(y))$$ Where $Q(y)$ is the predicted probability for the correct class.
2. What is Softmax?¶
Neural networks usually output raw scores called logits ($x$). These scores can be any real number ($-\infty$ to $+\infty$). However, to interpret them as probabilities, they must satisfy two conditions:
- Each probability must be between 0 and 1.
- The sum of all probabilities must be 1.
The Softmax function transforms logits into probabilities: $$\sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^{C} e^{x_j}}$$
Properties of Softmax:¶
- Exponentiation: Makes every value positive.
- Normalization: Ensures the sum is 1.
- Sensitivity: High values are magnified due to the exponential, making the model "confident" in its predictions.
3. Why combine Softmax and Cross-Entropy?¶
In Sorix (and PyTorch), CrossEntropyLoss takes raw logits as input and computes Softmax internally. This isn't just a matter of convenience; it's driven by two critical factors:
A. Numerical Stability¶
Softmax involves $e^x$. If $x=1000$, $e^{1000}$ results in a numerical overflow. To solve this, we subtract the maximum value: $$\sigma(x)_i = \frac{e^{x_i - \max(x)}}{\sum e^{x_j - \max(x)}}$$ This is mathematically equivalent but numerically stable. By implementing this inside the loss function, we ensure the intermediate probabilities never explode or vanish before being logged.
B. Mathematical Elegance (The Gradient)¶
When we take the derivative of the combined function, the complex terms cancel out perfectly, leading to a simple and fast gradient calculation.
4. Step-by-Step Mathematical Derivation¶
Let $x$ be the logits, $y$ the index of the correct class, and $P_i = \sigma(x)_i$ the predicted probability.
The loss for a single sample is: $$L = - \ln(P_y) = - \ln\left( \frac{e^{x_y}}{\sum_j e^{x_j}} \right)$$
Using log properties: $\ln(a/b) = \ln(a) - \ln(b)$ $$L = - [ \ln(e^{x_y}) - \ln(\sum_j e^{x_j}) ] = \ln(\sum_j e^{x_j}) - x_y$$
Now, let's calculate the gradient with respect to a logit $x_i$ ($\frac{\partial L}{\partial x_i}$):
Case 1: $i$ is not the target class ($i \neq y$) $$\frac{\partial L}{\partial x_i} = \frac{\partial}{\partial x_i} \ln(\sum_j e^{x_j}) - 0 = \frac{1}{\sum_j e^{x_j}} \cdot e^{x_i} = P_i$$
Case 2: $i$ is the target class ($i = y$) $$\frac{\partial L}{\partial x_i} = \frac{\partial}{\partial x_i} \ln(\sum_j e^{x_j}) - \frac{\partial}{\partial x_i} x_i = P_i - 1$$
Combined result: $$\frac{\partial L}{\partial x} = P - Y$$ Where $Y$ is a one-hot vector of the target. The gradient is simply the distance between our predicted probability and the truth ($P - Y$). If we include the batch mean: $$\text{Gradient} = \frac{1}{n}(P - Y)$$
This is beautiful! The gradient points exactly in the direction that decreases the probability of wrong classes and increases the probability of the correct class.
import numpy as np
from sorix import tensor
from sorix.nn import CrossEntropyLoss
# Logits for 3 classes
logits = tensor([[2.0, 1.0, 0.1], [0.0, 5.0, 0.2]], requires_grad=True)
targets = tensor([0, 1]) # Correct labels: index 0 and index 1
criterion = CrossEntropyLoss()
loss = criterion(logits, targets)
print(f"Logits sample 1: {logits.numpy()[0]} -> Prob for class 0 should be high")
print(f"Cross Entropy Loss: {loss.item():.4f}")
loss.backward()
print(f"\nGradients w.r.t logits (P - Y) / n:\n{logits.grad}")
Logits sample 1: [2. 1. 0.1] -> Prob for class 0 should be high
Cross Entropy Loss: 0.2159
Gradients w.r.t logits (P - Y) / n:
tensor([[-0.17049944, 0.12121648, 0.04928295],
[ 0.00331929, -0.00737348, 0.00405419]])
5. Handling Class Imbalance: Weighted Cross Entropy¶
In many datasets, classes aren't distributed equally. The model might ignore the minority class because it represents a small portion of the total loss.
To solve this, we assign a weight vector $w$ where $w_c$ is the importance of class $c$. $$L = - \frac{1}{\sum_{i=1}^{n} w_{y_i}} \sum_{i=1}^{n} w_{y_i} \ln(P_{i, y_i})$$
This ensures that errors in minority classes (highly weighted) result in larger gradient updates, forcing the model to learn them.
Mathematical Derivation of the Weighted Gradient¶
When we use weights, the total loss is the weighted sum of individual losses, normalized by the sum of weights of the targets present in the batch.
Let $w_{y_k}$ be the weight assigned to the target class of sample $k$, and $S = \sum_{j=1}^{n} w_{y_j}$ the normalization factor.
The total loss is: $$L = \frac{1}{S} \sum_{j=1}^{n} w_{y_j} \ell_j$$ Where $\ell_j = - \ln(P_{j, y_j})$ is the standard cross-entropy for sample $j$.
To find the gradient with respect to a logit $x_{k, i}$ (for sample $k$ and class $i$), we apply the chain rule: $$\frac{\partial L}{\partial x_{k, i}} = \frac{\partial L}{\partial \ell_k} \cdot \frac{\partial \ell_k}{\partial x_{k, i}}$$
- First term: $\frac{\partial L}{\partial \ell_k} = \frac{w_{y_k}}{S}$
- Second term: From our previous derivation, we know $\frac{\partial \ell_k}{\partial x_{k, i}} = P_{k, i} - Y_{k, i}$
Final result: $$\frac{\partial L}{\partial x_{k, i}} = \frac{w_{y_k}}{S} (P_{k, i} - Y_{k, i})$$
In vector form for sample $k$: $$\nabla_{x_k} L = \frac{w_{y_k}}{S} (P_k - Y_k)$$
Note that when all weights are 1, $w_{y_k} = 1$ and $S = n$, which returns us to the original unweighted formula: $\frac{1}{n}(P_k - Y_k)$.
# Demonstrating Class Weights
weights = tensor([1.0, 10.0]) # Class 1 is 10x more important than Class 0
criterion_weighted = CrossEntropyLoss(weight=weights)
logits = tensor([[2.0, 1.0], [2.0, 1.0]], requires_grad=True)
targets = tensor([1, 1]) # Both samples are Class 1, but model predicted Class 0 (bad!)
loss_weighted = criterion_weighted(logits, targets)
print(f"Weighted Loss: {loss_weighted.item():.4f}")
loss_weighted.backward()
print(f"\nGradients are now balanced by weights:\n{logits.grad}")
Weighted Loss: 1.3133
Gradients are now balanced by weights:
tensor([[ 0.3655293, -0.3655293],
[ 0.3655293, -0.3655293]])
6. Practical Training Example¶
Let's see how a single linear layer learns to separate classes using CrossEntropyLoss.
from sorix.optim import SGD
from sorix.nn import Linear
x = tensor([[1.0, 0.0, 0.0]]) # Input vector
target = tensor([2]) # We want the output to be Class 2
model = Linear(3, 3)
optimizer = SGD(model.parameters(), lr=0.1)
criterion = CrossEntropyLoss()
for i in range(51):
y_pred = model(x)
loss = criterion(y_pred, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
if i % 10 == 0:
print(f"Step {i:2d} | Loss: {loss.item():.4f} | Logits: {y_pred.numpy().round(3).flatten()}")
Step 0 | Loss: 2.0499 | Logits: [ 0.221 -1.419 -1.514] Step 10 | Loss: 0.5548 | Logits: [-0.82 -1.713 -0.178] Step 20 | Loss: 0.2448 | Logits: [-1.25 -1.91 0.449] Step 30 | Loss: 0.1491 | Logits: [-1.479 -2.034 0.802] Step 40 | Loss: 0.1057 | Logits: [-1.629 -2.123 1.041] Step 50 | Loss: 0.0814 | Logits: [-1.74 -2.191 1.22 ]