SGDMomentum¶
SGD with Momentum is an enhancement over standard SGD that helps it navigate the landscape of high-curvature regions by incorporating information from previous gradients. This reduces oscillations and speeds up the optimization process.
Mathematical definition¶
Let $\theta$ represent the parameters and $\nabla \mathcal{L}(\theta_t)$ the gradient at time $t$. SGDMomentum maintains a velocity vector $v_t$:
$$ v_{t} = \mu \cdot v_{t-1} + \nabla \mathcal{L}(\theta_t) $$ $$ \theta_{t+1} = \theta_t - \eta \cdot v_t $$
where:
- $v_t$: Accumulated velocity at time $t$.
- $\mu$: Momentum coefficient (typically 0.9).
- $\eta$: Learning rate ($lr$).
Implementation details¶
In Sorix, the SGDMomentum optimizer keeps track of the velocity vectors in a list (vts). These vectors are stored on the same device as the parameters, ensuring consistency across CPU and GPU setups.
In [1]:
Copied!
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'
In [2]:
Copied!
import numpy as np
from sorix import tensor
from sorix.optim import SGDMomentum
import sorix
import numpy as np
from sorix import tensor
from sorix.optim import SGDMomentum
import sorix
In [3]:
Copied!
# Same minimizing problem as SGD example: f(x, y) = x^2 + 10*y^2
# Notice how momentum accelerates the convergence despite the flat landscape in x
x = tensor([5.0], requires_grad=True)
y = tensor([5.0], requires_grad=True)
optimizer = SGDMomentum([x, y], lr=0.01, momentum=0.9)
for epoch in range(10):
loss = x * x + tensor([10.0]) * y * y
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}: x = {x.data[0]:.4f}, y = {y.data[0]:.4f}, loss = {loss.data[0]:.4f}")
# Same minimizing problem as SGD example: f(x, y) = x^2 + 10*y^2
# Notice how momentum accelerates the convergence despite the flat landscape in x
x = tensor([5.0], requires_grad=True)
y = tensor([5.0], requires_grad=True)
optimizer = SGDMomentum([x, y], lr=0.01, momentum=0.9)
for epoch in range(10):
loss = x * x + tensor([10.0]) * y * y
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}: x = {x.data[0]:.4f}, y = {y.data[0]:.4f}, loss = {loss.data[0]:.4f}")
Epoch 1: x = 4.9000, y = 4.0000, loss = 275.0000 Epoch 2: x = 4.7120, y = 2.3000, loss = 184.0100 Epoch 3: x = 4.4486, y = 0.3100, loss = 75.1030 Epoch 4: x = 4.1225, y = -1.5430, loss = 20.7507 Epoch 5: x = 3.7466, y = -2.9021, loss = 40.8034 Epoch 6: x = 3.3333, y = -3.5449, loss = 98.2587 Epoch 7: x = 2.8947, y = -3.4144, loss = 136.7721 Epoch 8: x = 2.4421, y = -2.6141, loss = 124.9600 Epoch 9: x = 1.9859, y = -1.3710, loss = 74.2980 Epoch 10: x = 1.5356, y = 0.0220, loss = 22.7398