RMSprop¶

RMSprop adaptively adjusts the learning rate for each parameter. It divides the learning rate by an exponentially decaying average of squared gradients, which prevents the learning rate from vanishing too quickly.

Mathematical definition¶

The update rules for RMSprop are:

$$ v_t = \rho \cdot v_{t-1} + (1 - \rho) \cdot (\nabla \mathcal{L}(\theta_t))^2 $$ $$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \cdot \nabla \mathcal{L}(\theta_t) $$

where:

$v_t$: Moving average of the squared gradients at time $t$.
$\rho$: Decay rate (often 0.9).
$\epsilon$: Small constant for numerical stability.
$\eta$: Learning rate ($lr$).

Implementation details¶

Sorix's RMSprop stores the historical gradients in the vts list. This adaptive method is particularly useful for recurrent neural networks and handling non-stationary objectives.

In [1]:

Copied!

# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'

In [2]:

Copied!





import numpy as np
from sorix import tensor
from sorix.optim import RMSprop
import sorix
import numpy as np
from sorix import tensor
from sorix.optim import RMSprop
import sorix

In [3]:

Copied!





# Minimize an anisotropic function: f(x, y) = x^2 + 10*y^2
# RMSprop normalizes the update using the moving average of squared gradients,
# effectively equalizing the step sizes across parameters with different gradient magnitudes.
x = tensor([5.0], requires_grad=True)
y = tensor([5.0], requires_grad=True)
optimizer = RMSprop([x, y], lr=0.1)

for epoch in range(10):
    loss = x * x + tensor([10.0]) * y * y
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(f"Epoch {epoch+1}: x = {x.data[0]:.4f}, y = {y.data[0]:.4f}, loss = {loss.data[0]:.4f}")
# Minimize an anisotropic function: f(x, y) = x^2 + 10*y^2
# RMSprop normalizes the update using the moving average of squared gradients,
# effectively equalizing the step sizes across parameters with different gradient magnitudes.
x = tensor([5.0], requires_grad=True)
y = tensor([5.0], requires_grad=True)
optimizer = RMSprop([x, y], lr=0.1)

for epoch in range(10):
    loss = x * x + tensor([10.0]) * y * y
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(f"Epoch {epoch+1}: x = {x.data[0]:.4f}, y = {y.data[0]:.4f}, loss = {loss.data[0]:.4f}")

Epoch 1: x = 4.0000, y = 4.0000, loss = 275.0000
Epoch 2: x = 3.3734, y = 3.3734, loss = 176.0000
Epoch 3: x = 2.9043, y = 2.9043, loss = 125.1775
Epoch 4: x = 2.5283, y = 2.5283, loss = 92.7866
Epoch 5: x = 2.2157, y = 2.2157, loss = 70.3128
Epoch 6: x = 1.9503, y = 1.9503, loss = 54.0031
Epoch 7: x = 1.7217, y = 1.7217, loss = 41.8402
Epoch 8: x = 1.5230, y = 1.5230, loss = 32.6073
Epoch 9: x = 1.3489, y = 1.3489, loss = 25.5132
Epoch 10: x = 1.1959, y = 1.1959, loss = 20.0162