Gradient Descent Linear Convergence (Operator Form) #

This file contains reusable gradient-descent convergence theorems.

The main theorems are stated for an operator g : E → E on a real inner product space. This is the right abstraction boundary for TorchLean:

If g is

μ-strongly monotone (a.k.a. μ-strongly accretive), and
L-Lipschitz,

then the fixed-point iteration

x_{k+1} = x_k - η g(x_k)

contracts distances to any root x⋆ of g, i.e. a point with g x⋆ = 0.

For gradients, the usual instantiation is g = ∇f. When f is μ-strongly convex and L-smooth, one can prove (in calculus/convex-analysis land) that ∇f is μ-strongly monotone and L-Lipschitz. This file focuses on the convergence argument itself, keeping the assumptions minimal and reusable.

The final ScalarGD namespace keeps the one-dimensional quadratic facts as a compact sanity check: they show the same contraction mechanism in the smallest possible setting and connect plain SGD, L2 regularization, and decoupled weight decay algebraically.

source

def Optim.GD.q (η μ : ℝ) (L : NNReal) :

ℝ

The squared-distance contraction factor from step_norm_sq_le.

Instances For

source

theorem Optim.GD.step_dist_sq_le {E : Type} [NormedAddCommGroup E] [InnerProductSpace ℝ E] (η μ : ℝ) (hη : 0 ≤ η) {L : NNReal} (g : E → E) (hmono : StrongMonotone μ g) (hlip : LipschitzWith L g) {xStar x : E} (hxStar : g xStar = 0) :

‖step η g x - xStar‖ ^ 2 ≤ q η μ L * ‖x - xStar‖ ^ 2

source

theorem Optim.GD.dist_sq_iterate_le_of_q_nonneg {E : Type} [NormedAddCommGroup E] [InnerProductSpace ℝ E] (η μ : ℝ) (hη : 0 ≤ η) {L : NNReal} (g : E → E) (hmono : StrongMonotone μ g) (hlip : LipschitzWith L g) {xStar x : E} (hxStar : g xStar = 0) (hq : 0 ≤ q η μ L) (k : ℕ) :

‖(step η g)^[k] x - xStar‖ ^ 2 ≤ q η μ L ^ k * ‖x - xStar‖ ^ 2

Iterated contraction bound in squared norm.

If q η μ L ≥ 0, then after k steps we have

‖(step η g)^[k] x - x⋆‖² ≤ (q η μ L)^k * ‖x - x⋆‖².

source

theorem Optim.GD.dist_sq_iterate_le_of_q_lt_one {E : Type} [NormedAddCommGroup E] [InnerProductSpace ℝ E] (η μ : ℝ) (hη : 0 ≤ η) {L : NNReal} (g : E → E) (hmono : StrongMonotone μ g) (hlip : LipschitzWith L g) {xStar x : E} (hxStar : g xStar = 0) (hq : 0 ≤ q η μ L) (hq1 : q η μ L < 1) (k : ℕ) :

‖(step η g)^[k] x - xStar‖ ^ 2 ≤ ‖x - xStar‖ ^ 2

Linear convergence phrased as an exponentially decaying upper bound.

This is just dist_sq_iterate_le_of_q_nonneg plus the assumption q < 1 which makes the RHS actually shrink with k.

Scalar quadratic warm-up #

These facts are deliberately small but not merely definitional. They prove algebraic behavior of gradient descent on the one-dimensional quadratic objective

L(x) = 1/2 * (x - target)^2,

whose gradient is x - target. This is the simplest executable bridge from TorchLean's optimizer equations to familiar convergence facts; the Hilbert-space operator theorem above is the reusable version for tensor/vector models.

source

def Optim.ScalarGD.quadraticGrad {α : Type} [Sub α] (target x : α) :

Gradient of 1/2 * (x - target)^2.

Instances For

source

def Optim.ScalarGD.step {α : Type} [Sub α] [Mul α] (lr target x : α) :

One scalar gradient-descent step on the quadratic objective.

Instances For

source

theorem Optim.ScalarGD.target_fixed {α : Type} [CommRing α] (lr target : α) :

step lr target target = target

The optimum is a fixed point of the scalar quadratic gradient-descent update.

source

theorem Optim.ScalarGD.error_after_step {α : Type} [CommRing α] (lr target x : α) :

step lr target x - target = (1 - lr) * (x - target)

One scalar quadratic gradient-descent step multiplies the current error by 1 - lr.

For ordered fields, this is the usual starting point for contraction proofs when 0 < lr < 2.

source

def Optim.ScalarGD.stepL2 {α : Type} [Sub α] [Mul α] [Add α] (lr lambda grad x : α) :

One step of SGD on a loss with L2 regularization term λ/2 * x^2 adds λ*x to the gradient.

Instances For

source

theorem Optim.ScalarGD.stepL2_eq_decoupledWeightDecay {α : Type} [CommRing α] (lr lambda grad x : α) :

stepL2 lr lambda grad x = (1 - lr * lambda) * x - lr * grad

For plain SGD, L2 regularization and decoupled weight decay coincide at the update level.

This scalar statement is the common fact behind the regularization note: adding λ x to the gradient produces the same update as multiplying parameters by (1 - lr*λ) and then taking the plain gradient step. Adaptive optimizers need separate treatment; AdamW is checked in Optimization.FirstOrder.

source

theorem Optim.ScalarGD.error_abs_contract_real (lr target x : ℝ) (h0 : 0 < lr) (h2 : lr < 2) (hne : x ≠ target) :

|step lr target x - target| < |x - target|

On the 1D quadratic, if 0 < lr < 2 then one GD step contracts the error in absolute value.

This is the scalar version of the operator-level contraction theorem above.

TorchLean API

NN.MLTheory.Optimization.StronglyConvexGD

Gradient Descent Linear Convergence (Operator Form) #

Scalar quadratic warm-up #