SGDMomentum¶

SGD with Momentum is an enhancement over standard SGD that helps it navigate the landscape of high-curvature regions by incorporating information from previous gradients. This reduces oscillations and speeds up the optimization process.

Mathematical definition¶

Let $\theta$ represent the parameters and $\nabla \mathcal{L}(\theta_t)$ the gradient at time $t$. SGDMomentum maintains a velocity vector $v_t$:

$$ v_{t} = \mu \cdot v_{t-1} + \nabla \mathcal{L}(\theta_t) $$ $$ \theta_{t+1} = \theta_t - \eta \cdot v_t $$

where:

$v_t$: Accumulated velocity at time $t$.
$\mu$: Momentum coefficient (typically 0.9).
$\eta$: Learning rate ($lr$).

Implementation details¶

In Sorix, the SGDMomentum optimizer keeps track of the velocity vectors in a list (vts). These vectors are stored on the same device as the parameters, ensuring consistency across CPU and GPU setups.

In [1]:

Copied!

# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'

In [2]:

Copied!





import numpy as np
from sorix import tensor
from sorix.optim import SGDMomentum
import sorix
import numpy as np
from sorix import tensor
from sorix.optim import SGDMomentum
import sorix

In [3]:

Copied!





# Same minimizing problem as SGD example: f(x, y) = x^2 + 10*y^2
# Notice how momentum accelerates the convergence despite the flat landscape in x
x = tensor([5.0], requires_grad=True)
y = tensor([5.0], requires_grad=True)
optimizer = SGDMomentum([x, y], lr=0.01, momentum=0.9)

for epoch in range(10):
    loss = x * x + tensor([10.0]) * y * y
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(f"Epoch {epoch+1}: x = {x.data[0]:.4f}, y = {y.data[0]:.4f}, loss = {loss.data[0]:.4f}")
# Same minimizing problem as SGD example: f(x, y) = x^2 + 10*y^2
# Notice how momentum accelerates the convergence despite the flat landscape in x
x = tensor([5.0], requires_grad=True)
y = tensor([5.0], requires_grad=True)
optimizer = SGDMomentum([x, y], lr=0.01, momentum=0.9)

for epoch in range(10):
    loss = x * x + tensor([10.0]) * y * y
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(f"Epoch {epoch+1}: x = {x.data[0]:.4f}, y = {y.data[0]:.4f}, loss = {loss.data[0]:.4f}")

Epoch 1: x = 4.9000, y = 4.0000, loss = 275.0000
Epoch 2: x = 4.7120, y = 2.3000, loss = 184.0100
Epoch 3: x = 4.4486, y = 0.3100, loss = 75.1030
Epoch 4: x = 4.1225, y = -1.5430, loss = 20.7507
Epoch 5: x = 3.7466, y = -2.9021, loss = 40.8034
Epoch 6: x = 3.3333, y = -3.5449, loss = 98.2587
Epoch 7: x = 2.8947, y = -3.4144, loss = 136.7721
Epoch 8: x = 2.4421, y = -2.6141, loss = 124.9600
Epoch 9: x = 1.9859, y = -1.3710, loss = 74.2980
Epoch 10: x = 1.5356, y = 0.0220, loss = 22.7398