TorchLean API

NN.MLTheory.Optimization.FirstOrder

Runtime Optimizer Equations #

Small executable theorems about TorchLean's optimizer equations.

These are intentionally modest. They prove properties of the update rules that TorchLean actually executes, rather than broad convergence claims that would require assumptions about convexity, smoothness, stochastic gradients, and floating-point error. Larger optimization theory can build on these equations.

This is the tensor-facing layer: the statements are phrased over Spec.Tensor and the executable operator dictionary Spec.Context. When we want ordinary algebraic simplification, such as proving that a zero weight-decay AdamW update is the same parameter update as Adam, we specialize to , where mathlib provides the ring laws.

theorem Optim.SGD.update_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (state : State α s) (params grads : Spec.Tensor α s) :
update state params grads = params.subSpec (grads.scaleSpec state.lr)

SGD is definitionally p - lr * g.

theorem Optim.MomentumSGD.update_buffer_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (state : State α s) (params grads : Spec.Tensor α s) :
(update state params grads).1.buf = (state.buf.scaleSpec state.momentum).addSpec grads

Momentum SGD updates its buffer to momentum * old_buffer + gradient.

theorem Optim.MomentumSGD.update_params_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (state : State α s) (params grads : Spec.Tensor α s) :
(update state params grads).2 = params.subSpec (((state.buf.scaleSpec state.momentum).addSpec grads).scaleSpec state.lr)

Momentum SGD updates parameters using the freshly updated buffer.

theorem Optim.Adam.update_time_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (state : State α s) (params grads : Spec.Tensor α s) :
(update state params grads).1.t = state.t + 1

Adam increments its bias-correction counter by exactly one every step.

theorem Optim.AdamW.update_time_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (state : State α s) (params grads : Spec.Tensor α s) :
(update state params grads).1.t = state.t + 1

AdamW increments its bias-correction counter by exactly one every step.

Context α gives us executable operators, but not algebraic laws like x * 0 = 0.

Optimizer-relationship theorems (AdamW → Adam, L2 vs weight decay, etc.) therefore live most naturally over proof backends like where the laws are available from Mathlib.

theorem Optim.AdamW.update_weight_decay_zero_params_eq_adam_real {s : Spec.Shape} (state : State s) (params grads : Spec.Tensor s) (hwd : state.weight_decay = 0) :
(update state params grads).2 = (Adam.update { lr := state.lr, beta1 := state.beta1, beta2 := state.beta2, epsilon := state.epsilon, m := state.m, v := state.v, t := state.t } params grads).2

AdamW reduces to Adam when weight_decay = 0 (parameter-update equality), over .

theorem Optim.Adadelta.update_v_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (state : State α s) (params grads : Spec.Tensor α s) :
(update state params grads).1.v = (state.v.scaleSpec state.rho).addSpec (grads.squareSpec.scaleSpec (1 - state.rho))

Adadelta's gradient accumulator follows the documented EMA equation.