Runtime Optimizer Equations #
Small executable theorems about TorchLean's optimizer equations.
These are intentionally modest. They prove properties of the update rules that TorchLean actually executes, rather than broad convergence claims that would require assumptions about convexity, smoothness, stochastic gradients, and floating-point error. Larger optimization theory can build on these equations.
This is the tensor-facing layer: the statements are phrased over Spec.Tensor and the executable
operator dictionary Spec.Context. When we want ordinary algebraic simplification, such as proving
that a zero weight-decay AdamW update is the same parameter update as Adam, we specialize to ℝ,
where mathlib provides the ring laws.
SGD is definitionally p - lr * g.
Momentum SGD updates its buffer to momentum * old_buffer + gradient.
Momentum SGD updates parameters using the freshly updated buffer.
Adam increments its bias-correction counter by exactly one every step.
AdamW increments its bias-correction counter by exactly one every step.
Context α gives us executable operators, but not algebraic laws like x * 0 = 0.
Optimizer-relationship theorems (AdamW → Adam, L2 vs weight decay, etc.) therefore live most
naturally over proof backends like ℝ where the laws are available from Mathlib.
AdamW reduces to Adam when weight_decay = 0 (parameter-update equality), over ℝ.
Adadelta's gradient accumulator follows the documented EMA equation.