Runtime Optimizer Equations #

Small executable theorems about TorchLean's optimizer equations.

These are intentionally modest. They prove properties of the update rules that TorchLean actually executes, rather than broad convergence claims that would require assumptions about convexity, smoothness, stochastic gradients, and floating-point error. Larger optimization theory can build on these equations.

This is the tensor-facing layer: the statements are phrased over Spec.Tensor and the executable operator dictionary Spec.Context. When we want ordinary algebraic simplification, such as proving that a zero weight-decay AdamW update is the same parameter update as Adam, we specialize to ℝ, where mathlib provides the ring laws.

source

theorem Optim.SGD.update_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (state : State α s) (params grads : Spec.Tensor α s) :

update state params grads = params.subSpec (grads.scaleSpec state.lr)

SGD is definitionally p - lr * g.

source

theorem Optim.MomentumSGD.update_buffer_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (state : State α s) (params grads : Spec.Tensor α s) :

(update state params grads).1.buf = (state.buf.scaleSpec state.momentum).addSpec grads

Momentum SGD updates its buffer to momentum * old_buffer + gradient.

source

theorem Optim.MomentumSGD.update_params_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (state : State α s) (params grads : Spec.Tensor α s) :

(update state params grads).2 = params.subSpec (((state.buf.scaleSpec state.momentum).addSpec grads).scaleSpec state.lr)

Momentum SGD updates parameters using the freshly updated buffer.

source

theorem Optim.Adam.update_time_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (state : State α s) (params grads : Spec.Tensor α s) :

(update state params grads).1.t = state.t + 1

Adam increments its bias-correction counter by exactly one every step.

source

theorem Optim.AdamW.update_time_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (state : State α s) (params grads : Spec.Tensor α s) :

(update state params grads).1.t = state.t + 1

AdamW increments its bias-correction counter by exactly one every step.

Context α gives us executable operators, but not algebraic laws like x * 0 = 0.

Optimizer-relationship theorems (AdamW → Adam, L2 vs weight decay, etc.) therefore live most naturally over proof backends like ℝ where the laws are available from Mathlib.

source

theorem Optim.AdamW.update_weight_decay_zero_params_eq_adam_real {s : Spec.Shape} (state : State ℝ s) (params grads : Spec.Tensor ℝ s) (hwd : state.weight_decay = 0) :

(update state params grads).2 = (Adam.update { lr := state.lr, beta1 := state.beta1, beta2 := state.beta2, epsilon := state.epsilon, m := state.m, v := state.v, t := state.t } params grads).2

AdamW reduces to Adam when weight_decay = 0 (parameter-update equality), over ℝ.

source

theorem Optim.Adadelta.update_v_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (state : State α s) (params grads : Spec.Tensor α s) :

(update state params grads).1.v = (state.v.scaleSpec state.rho).addSpec (grads.squareSpec.scaleSpec (1 - state.rho))

Adadelta's gradient accumulator follows the documented EMA equation.

TorchLean API

NN.MLTheory.Optimization.FirstOrder

Runtime Optimizer Equations #