Low-rank and Orthogonalized Optimizer Facts #

This file records invariants for the optimizer extension points:

a GaLore-style projected gradient update reduces to SGD when the projector is identity;
a Muon-style orthogonalized momentum update reduces to momentum SGD when the orthogonalizer is identity.

That gives us a clean proof boundary: projector construction and matrix orthogonalization can be optimized later, while the surrounding update semantics already have a checked fallback case.

source

theorem Optim.GaLore.projectedSGD_identity_eq_sgd {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (lr : α) (params grads : Spec.Tensor α s) :

projectedSGDUpdate { lr := lr, projector := identityProjector } params grads = SGD.update { lr := lr } params grads

With the identity projector, projected SGD is exactly ordinary SGD.

source

theorem Optim.Muon.update_identity_params_eq_momentumSGD {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (lr momentum : α) (buf params grads : Spec.Tensor α s) :

(update { lr := lr, momentum := momentum, buf := buf, orthogonalizer := identityOrthogonalizer } params grads).2 = (MomentumSGD.update { lr := lr, momentum := momentum, buf := buf } params grads).2

With the identity orthogonalizer, Muon and momentum SGD use the same parameter update.

source

theorem Optim.Muon.update_identity_buffer_eq_momentumSGD {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (lr momentum : α) (buf params grads : Spec.Tensor α s) :

(update { lr := lr, momentum := momentum, buf := buf, orthogonalizer := identityOrthogonalizer } params grads).1.buf = (MomentumSGD.update { lr := lr, momentum := momentum, buf := buf } params grads).1.buf

With the identity orthogonalizer, Muon and momentum SGD store the same next buffer.

source

theorem Optim.Muon.update_buffer_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (state : State α s) (params grads : Spec.Tensor α s) :

(update state params grads).1.buf = (state.buf.scaleSpec state.momentum).addSpec grads

Expanded form of the Muon buffer update.

source

theorem Optim.Muon.update_identity_params_eq {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (lr momentum : α) (buf params grads : Spec.Tensor α s) :

(update { lr := lr, momentum := momentum, buf := buf, orthogonalizer := identityOrthogonalizer } params grads).2 = params.subSpec (((buf.scaleSpec momentum).addSpec grads).scaleSpec lr)

Expanded form of the identity-backend Muon parameter update.

TorchLean API

NN.MLTheory.Optimization.LowRank

Low-rank and Orthogonalized Optimizer Facts #