Optimizers #

Optimizers for TorchLean runtime training.

This file implements the core math of common gradient-based optimizers as pure functions on typed tensors Tensor α s.

Why “pure functions”?

In PyTorch, optimizers mutate parameters in-place and keep state in Python objects. In TorchLean, we want the update rule itself to be explicit and easy to reuse:

eager demos can call the update directly,
the runtime training engine can store state in maps keyed by parameter ids,
and proofs can refer to the same update equations.

The intent is to mimic the standard textbook formulas closely. We do not try to reproduce every implementation detail of torch.optim.* (e.g. foreach kernels, fused updates, or every optional flag); those live at a different layer than the math we specify here.

Where this file sits in the stack:

this file owns the scalar-polymorphic, per-tensor update equations;
NN.Runtime.Autograd.TorchLean.Optim lifts those equations to runtime parameter lists; and
NN.API.Runtime exposes ergonomic optim.sgd, optim.adam, and related configuration helpers.

That separation is deliberate: the formula appears once, while runtime adapters and public API configuration can evolve independently around it.

Why each optimizer has its own State structure:

Lean structures do not inherit from one another the way Python classes do.
More importantly, optimizer state is not uniform: SGD stores only lr, momentum SGD stores a buffer, Adam/AdamW store two moment buffers and a step counter, Adadelta stores gradient/update EMAs, and Muon/GaLore carry backend functions.
Keeping these as separate typed states makes impossible states unrepresentable. For example, an SGD state cannot accidentally contain a stale Adam v buffer, and AdamW cannot forget its decoupled weight_decay coefficient.

The generic abstraction lives one layer up:

Runtime.Autograd.TorchLean.Optim.Optimizer packages init/step for shape-indexed parameter lists, like a typed analogue of a PyTorch optimizer object.
Runtime.Autograd.Train.OptimizerState handles dynamic parameter groups and checkpoint-style maps for the training-loop API.

So this file intentionally favors small canonical state records over an inheritance hierarchy.

References (original algorithms / common variants):

AdaGrad (Duchi–Hazan–Singer, 2011): https://jmlr.org/papers/v12/duchi11a.html
RMSProp (Hinton lecture notes; widely used variant): https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Adam (Kingma–Ba, 2015): https://arxiv.org/abs/1412.6980
AdamW / decoupled weight decay (Loshchilov–Hutter, 2019): https://arxiv.org/abs/1711.05101
Adadelta (Zeiler, 2012): https://arxiv.org/abs/1212.5701
SGD + momentum in deep learning practice (Sutskever et al., 2013): https://arxiv.org/abs/1301.4083
GaLore / low-rank gradient projection (Zhao et al., 2024): https://arxiv.org/abs/2403.03507
Muon-style momentum with orthogonalized matrix updates (Jordan et al., 2024): https://kellerjordan.github.io/posts/muon/

PyTorch references (for API/parameter naming):

torch.optim overview: https://pytorch.org/docs/stable/optim.html
torch.optim.SGD: https://pytorch.org/docs/stable/generated/torch.optim.SGD.html
torch.optim.Adagrad: https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html
torch.optim.RMSprop: https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html
torch.optim.Adam: https://pytorch.org/docs/stable/generated/torch.optim.Adam.html
torch.optim.AdamW: https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html
torch.optim.Adadelta: https://pytorch.org/docs/stable/generated/torch.optim.Adadelta.html

Optimizers #

Shared utilities #

SGD #

Momentum SGD #

AdaGrad #

RMSProp #

Adam #

AdamW #

Adadelta #

Projected / low-rank optimizers #

Muon-style orthogonalized momentum #