RMSprop¶
RMSprop adaptively adjusts the learning rate for each parameter. It divides the learning rate by an exponentially decaying average of squared gradients, which prevents the learning rate from vanishing too quickly.
Mathematical definition¶
The update rules for RMSprop are:
$$ v_t = \rho \cdot v_{t-1} + (1 - \rho) \cdot (\nabla \mathcal{L}(\theta_t))^2 $$ $$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \cdot \nabla \mathcal{L}(\theta_t) $$
where:
- $v_t$: Moving average of the squared gradients at time $t$.
- $\rho$: Decay rate (often 0.9).
- $\epsilon$: Small constant for numerical stability.
- $\eta$: Learning rate ($lr$).
Implementation details¶
Sorix's RMSprop stores the historical gradients in the vts list. This adaptive method is particularly useful for recurrent neural networks and handling non-stationary objectives.
In [1]:
Copied!
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'
In [2]:
Copied!
import numpy as np
from sorix import tensor
from sorix.optim import RMSprop
import sorix
import numpy as np
from sorix import tensor
from sorix.optim import RMSprop
import sorix
In [3]:
Copied!
# Minimize an anisotropic function: f(x, y) = x^2 + 10*y^2
# RMSprop normalizes the update using the moving average of squared gradients,
# effectively equalizing the step sizes across parameters with different gradient magnitudes.
x = tensor([5.0], requires_grad=True)
y = tensor([5.0], requires_grad=True)
optimizer = RMSprop([x, y], lr=0.1)
for epoch in range(10):
loss = x * x + tensor([10.0]) * y * y
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}: x = {x.data[0]:.4f}, y = {y.data[0]:.4f}, loss = {loss.data[0]:.4f}")
# Minimize an anisotropic function: f(x, y) = x^2 + 10*y^2
# RMSprop normalizes the update using the moving average of squared gradients,
# effectively equalizing the step sizes across parameters with different gradient magnitudes.
x = tensor([5.0], requires_grad=True)
y = tensor([5.0], requires_grad=True)
optimizer = RMSprop([x, y], lr=0.1)
for epoch in range(10):
loss = x * x + tensor([10.0]) * y * y
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}: x = {x.data[0]:.4f}, y = {y.data[0]:.4f}, loss = {loss.data[0]:.4f}")
Epoch 1: x = 4.0000, y = 4.0000, loss = 275.0000 Epoch 2: x = 3.3734, y = 3.3734, loss = 176.0000 Epoch 3: x = 2.9043, y = 2.9043, loss = 125.1775 Epoch 4: x = 2.5283, y = 2.5283, loss = 92.7866 Epoch 5: x = 2.2157, y = 2.2157, loss = 70.3128 Epoch 6: x = 1.9503, y = 1.9503, loss = 54.0031 Epoch 7: x = 1.7217, y = 1.7217, loss = 41.8402 Epoch 8: x = 1.5230, y = 1.5230, loss = 32.6073 Epoch 9: x = 1.3489, y = 1.3489, loss = 25.5132 Epoch 10: x = 1.1959, y = 1.1959, loss = 20.0162