SGD¶

Stochastic Gradient Descent (SGD) is a fundamental optimization algorithm used in machine learning. It updates the model parameters by taking a step in the direction of the negative gradient of the loss function.

Mathematical definition¶

Let $\theta$ represent the parameters of the model and $\mathcal{L}$ the loss function. The update rule for SGD is defined as:

$$ \theta_{t+1} = \theta_t - \eta \cdot \nabla \mathcal{L}(\theta_t) $$

where:

$\theta_t$: Parameters at time $t$
$\eta$: Learning rate ($lr$), a positive scalar determining the step size.
$\nabla \mathcal{L}(\theta_t)$: Gradient of the loss with respect to the parameters at time $t$.

Implementation details¶

In Sorix, the SGD optimizer iterates through the parameters and updates their data attribute using the calculated grad. This operation is performed in-place and handles both CPU and GPU tensors automatically.

In [1]:

Copied!

# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'

In [2]:

Copied!





import numpy as np
from sorix import tensor
from sorix.optim import SGD
import sorix
import numpy as np
from sorix import tensor
from sorix.optim import SGD
import sorix

In [3]:

Copied!





# Simple optimization example: minimize an anisotropic parabolic function: f(x, y) = x^2 + 10*y^2
# This surface challenges standard SGD as it tends to oscillate in the steeper y-direction
x = tensor([5.0], requires_grad=True)
y = tensor([5.0], requires_grad=True)
optimizer = SGD([x, y], lr=0.01)

for epoch in range(10):
    # compute loss: f(x, y) = x^2 + 10*y^2
    loss = x * x + tensor([10.0]) * y * y
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(f"Epoch {epoch+1}: x = {x.data[0]:.4f}, y = {y.data[0]:.4f}, loss = {loss.data[0]:.4f}")
# Simple optimization example: minimize an anisotropic parabolic function: f(x, y) = x^2 + 10*y^2
# This surface challenges standard SGD as it tends to oscillate in the steeper y-direction
x = tensor([5.0], requires_grad=True)
y = tensor([5.0], requires_grad=True)
optimizer = SGD([x, y], lr=0.01)

for epoch in range(10):
    # compute loss: f(x, y) = x^2 + 10*y^2
    loss = x * x + tensor([10.0]) * y * y
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(f"Epoch {epoch+1}: x = {x.data[0]:.4f}, y = {y.data[0]:.4f}, loss = {loss.data[0]:.4f}")

Epoch 1: x = 4.9000, y = 4.0000, loss = 275.0000
Epoch 2: x = 4.8020, y = 3.2000, loss = 184.0100
Epoch 3: x = 4.7060, y = 2.5600, loss = 125.4592
Epoch 4: x = 4.6118, y = 2.0480, loss = 87.6821
Epoch 5: x = 4.5196, y = 1.6384, loss = 63.2121
Epoch 6: x = 4.4292, y = 1.3107, loss = 47.2704
Epoch 7: x = 4.3406, y = 1.0486, loss = 36.7978
Epoch 8: x = 4.2538, y = 0.8389, loss = 29.8362
Epoch 9: x = 4.1687, y = 0.6711, loss = 25.1318
Epoch 10: x = 4.0854, y = 0.5369, loss = 21.8820