Gradient Descent Linear Convergence (Operator Form) #
This file contains reusable gradient-descent convergence theorems.
The main theorems are stated for an operator g : E → E on a real inner product space. This is the
right abstraction boundary for TorchLean:
If g is
- μ-strongly monotone (a.k.a.
μ-strongly accretive), and - L-Lipschitz,
then the fixed-point iteration
x_{k+1} = x_k - η g(x_k)
contracts distances to any root x⋆ of g, i.e. a point with g x⋆ = 0.
For gradients, the usual instantiation is g = ∇f. When f is μ-strongly convex and L-smooth,
one can prove (in calculus/convex-analysis land) that ∇f is μ-strongly monotone and L-Lipschitz.
This file focuses on the convergence argument itself, keeping the assumptions minimal and reusable.
The final ScalarGD namespace keeps the one-dimensional quadratic facts as a compact sanity check:
they show the same contraction mechanism in the smallest possible setting and connect plain SGD,
L2 regularization, and decoupled weight decay algebraically.
The squared-distance contraction factor from step_norm_sq_le.
Instances For
Iterated contraction bound in squared norm.
If q η μ L ≥ 0, then after k steps we have
‖(step η g)^[k] x - x⋆‖² ≤ (q η μ L)^k * ‖x - x⋆‖².
Linear convergence phrased as an exponentially decaying upper bound.
This is just dist_sq_iterate_le_of_q_nonneg plus the assumption q < 1 which makes the RHS
actually shrink with k.
Scalar quadratic warm-up #
These facts are deliberately small but not merely definitional. They prove algebraic behavior of gradient descent on the one-dimensional quadratic objective
L(x) = 1/2 * (x - target)^2,
whose gradient is x - target. This is the simplest executable bridge from TorchLean's optimizer
equations to familiar convergence facts; the Hilbert-space operator theorem above is the reusable
version for tensor/vector models.
Gradient of 1/2 * (x - target)^2.
Instances For
One scalar gradient-descent step on the quadratic objective.
Instances For
One scalar quadratic gradient-descent step multiplies the current error by 1 - lr.
For ordered fields, this is the usual starting point for contraction proofs when 0 < lr < 2.
For plain SGD, L2 regularization and decoupled weight decay coincide at the update level.
This scalar statement is the common fact behind the regularization note: adding λ x to the
gradient produces the same update as multiplying parameters by (1 - lr*λ) and then taking the
plain gradient step. Adaptive optimizers need separate treatment; AdamW is checked in
Optimization.FirstOrder.
On the 1D quadratic, if 0 < lr < 2 then one GD step contracts the error in absolute value.
This is the scalar version of the operator-level contraction theorem above.