TorchLean API

NN.MLTheory.Optimization.StronglyConvexGD

Gradient Descent Linear Convergence (Operator Form) #

This file contains reusable gradient-descent convergence theorems.

The main theorems are stated for an operator g : E → E on a real inner product space. This is the right abstraction boundary for TorchLean:

If g is

then the fixed-point iteration

x_{k+1} = x_k - η g(x_k)

contracts distances to any root x⋆ of g, i.e. a point with g x⋆ = 0.

For gradients, the usual instantiation is g = ∇f. When f is μ-strongly convex and L-smooth, one can prove (in calculus/convex-analysis land) that ∇f is μ-strongly monotone and L-Lipschitz. This file focuses on the convergence argument itself, keeping the assumptions minimal and reusable.

The final ScalarGD namespace keeps the one-dimensional quadratic facts as a compact sanity check: they show the same contraction mechanism in the smallest possible setting and connect plain SGD, L2 regularization, and decoupled weight decay algebraically.

def Optim.GD.q (η μ : ) (L : NNReal) :

The squared-distance contraction factor from step_norm_sq_le.

Instances For
    theorem Optim.GD.step_dist_sq_le {E : Type} [NormedAddCommGroup E] [InnerProductSpace E] (η μ : ) ( : 0 η) {L : NNReal} (g : EE) (hmono : StrongMonotone μ g) (hlip : LipschitzWith L g) {xStar x : E} (hxStar : g xStar = 0) :
    step η g x - xStar ^ 2 q η μ L * x - xStar ^ 2
    theorem Optim.GD.dist_sq_iterate_le_of_q_nonneg {E : Type} [NormedAddCommGroup E] [InnerProductSpace E] (η μ : ) ( : 0 η) {L : NNReal} (g : EE) (hmono : StrongMonotone μ g) (hlip : LipschitzWith L g) {xStar x : E} (hxStar : g xStar = 0) (hq : 0 q η μ L) (k : ) :
    (step η g)^[k] x - xStar ^ 2 q η μ L ^ k * x - xStar ^ 2

    Iterated contraction bound in squared norm.

    If q η μ L ≥ 0, then after k steps we have

    ‖(step η g)^[k] x - x⋆‖² ≤ (q η μ L)^k * ‖x - x⋆‖².

    theorem Optim.GD.dist_sq_iterate_le_of_q_lt_one {E : Type} [NormedAddCommGroup E] [InnerProductSpace E] (η μ : ) ( : 0 η) {L : NNReal} (g : EE) (hmono : StrongMonotone μ g) (hlip : LipschitzWith L g) {xStar x : E} (hxStar : g xStar = 0) (hq : 0 q η μ L) (hq1 : q η μ L < 1) (k : ) :
    (step η g)^[k] x - xStar ^ 2 x - xStar ^ 2

    Linear convergence phrased as an exponentially decaying upper bound.

    This is just dist_sq_iterate_le_of_q_nonneg plus the assumption q < 1 which makes the RHS actually shrink with k.

    Scalar quadratic warm-up #

    These facts are deliberately small but not merely definitional. They prove algebraic behavior of gradient descent on the one-dimensional quadratic objective

    L(x) = 1/2 * (x - target)^2,

    whose gradient is x - target. This is the simplest executable bridge from TorchLean's optimizer equations to familiar convergence facts; the Hilbert-space operator theorem above is the reusable version for tensor/vector models.

    def Optim.ScalarGD.quadraticGrad {α : Type} [Sub α] (target x : α) :
    α

    Gradient of 1/2 * (x - target)^2.

    Instances For
      def Optim.ScalarGD.step {α : Type} [Sub α] [Mul α] (lr target x : α) :
      α

      One scalar gradient-descent step on the quadratic objective.

      Instances For
        theorem Optim.ScalarGD.target_fixed {α : Type} [CommRing α] (lr target : α) :
        step lr target target = target

        The optimum is a fixed point of the scalar quadratic gradient-descent update.

        theorem Optim.ScalarGD.error_after_step {α : Type} [CommRing α] (lr target x : α) :
        step lr target x - target = (1 - lr) * (x - target)

        One scalar quadratic gradient-descent step multiplies the current error by 1 - lr.

        For ordered fields, this is the usual starting point for contraction proofs when 0 < lr < 2.

        def Optim.ScalarGD.stepL2 {α : Type} [Sub α] [Mul α] [Add α] (lr lambda grad x : α) :
        α

        One step of SGD on a loss with L2 regularization term λ/2 * x^2 adds λ*x to the gradient.

        Instances For
          theorem Optim.ScalarGD.stepL2_eq_decoupledWeightDecay {α : Type} [CommRing α] (lr lambda grad x : α) :
          stepL2 lr lambda grad x = (1 - lr * lambda) * x - lr * grad

          For plain SGD, L2 regularization and decoupled weight decay coincide at the update level.

          This scalar statement is the common fact behind the regularization note: adding λ x to the gradient produces the same update as multiplying parameters by (1 - lr*λ) and then taking the plain gradient step. Adaptive optimizers need separate treatment; AdamW is checked in Optimization.FirstOrder.

          theorem Optim.ScalarGD.error_abs_contract_real (lr target x : ) (h0 : 0 < lr) (h2 : lr < 2) (hne : x target) :
          |step lr target x - target| < |x - target|

          On the 1D quadratic, if 0 < lr < 2 then one GD step contracts the error in absolute value.

          This is the scalar version of the operator-level contraction theorem above.