TorchLean API

Docs Home Guide Examples Graphs

NN.Proofs.Autograd.Tape.Ops.Norm.LayerNorm

LayerNorm #

Pointwise analytic correctness for a LayerNorm graph.

This is spec-level over ℝ. It is the proof-tape counterpart of the runtime/spec LayerNorm in Spec.layerNorm: a seqLen × embedDim tensor is normalized across the last axis, the row-wise normalizer is broadcast back over each token, and affine parameters gamma/beta are broadcast over the sequence dimension. The runtime API and compiled IR path both route through that spec definition; this file proves the corresponding reverse-mode graph rule.

Because the proof graph uses the differentiable scalar nodes sqrt (max x 0) and inv, the main theorem is pointwise (GraphFDerivCorrectAt) with explicit domain assumptions. Those hypotheses are the honest mathematical boundary: away from the clamp kink and zero denominator, backprop is the adjoint of the Fréchet derivative. The executable Spec.layerNorm additionally clamps the raw variance before adding epsilon as a numerical guard; over exact real variance this is the same contract on the positive branch used by the proof.

PyTorch correspondence / citations #

Conceptually corresponds to torch.nn.LayerNorm (without batching/running stats): normalize along the last dimension, then apply affine parameters (gamma,beta). https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html

@[reducible, inline]

abbrev Proofs.Autograd.LayerNorm.MatShape (m n : ℕ) :

Matrix shape m×n.

Instances For

@[reducible, inline]

abbrev Proofs.Autograd.LayerNorm.VecShape (k : ℕ) :

Vector shape k.

Instances For

@[reducible, inline]

abbrev Proofs.Autograd.LayerNorm.ΓLN (m n : ℕ) :

List Spec.Shape

Input context shapes: [X, gamma, beta] for layer norm over the last axis.

Instances For

@[reducible, inline]

abbrev Proofs.Autograd.LayerNorm.ssPrefix6 (m n : ℕ) :

List Spec.Shape

First 6 intermediates in the LayerNorm computation (up to var_eps).

Instances For

@[reducible, inline]

abbrev Proofs.Autograd.LayerNorm.ssPrefix7 (m n : ℕ) :

List Spec.Shape

Prefix intermediates up to std (adds one more vector).

Instances For

@[reducible, inline]

abbrev Proofs.Autograd.LayerNorm.ssLayerNorm (m n : ℕ) :

List Spec.Shape

Full list of intermediates for the LayerNorm graph in this file.

Instances For

def Proofs.Autograd.LayerNorm.idxX {m n : ℕ} {ss : List Spec.Shape} :

Idx (ΓLN m n ++ ss) (MatShape m n)

Index of the input matrix X in the base LayerNorm context ΓLN m n ++ ss.

Instances For

def Proofs.Autograd.LayerNorm.idxGamma {m n : ℕ} {ss : List Spec.Shape} :

Idx (ΓLN m n ++ ss) (VecShape n)

Index of the scale vector gamma in the base LayerNorm context ΓLN m n ++ ss.

Instances For

def Proofs.Autograd.LayerNorm.idxBeta {m n : ℕ} {ss : List Spec.Shape} :

Idx (ΓLN m n ++ ss) (VecShape n)

Index of the shift vector beta in the base LayerNorm context ΓLN m n ++ ss.

Instances For

def Proofs.Autograd.LayerNorm.idxLast {Γ ss : List Spec.Shape} {τ : Spec.Shape} :

Idx (Γ ++ ss ++ [τ]) τ

Index helper for the last element of an extended context Γ ++ ss ++ [τ].

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeMean {m n : ℕ} :

Node (ΓLN m n) (VecShape m)

Mean over the last axis: mean : ℝ^{m×n} → ℝ^{m}.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.g1 {m n : ℕ} :

Graph (ΓLN m n) [VecShape m]

Graph prefix producing [mean].

Instances For

def Proofs.Autograd.LayerNorm.idxMean {m n : ℕ} :

Idx (ΓLN m n ++ [VecShape m]) (VecShape m)

Index of mean in the extended context ΓLN ++ [mean].

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeMeanB {m n : ℕ} :

Node (ΓLN m n ++ [VecShape m]) (MatShape m n)

Broadcast mean back to m×n (row-wise).

Instances For

noncomputable def Proofs.Autograd.LayerNorm.g2 {m n : ℕ} :

Graph (ΓLN m n) [VecShape m, MatShape m n]

Graph prefix producing [mean, mean_b].

Instances For

def Proofs.Autograd.LayerNorm.idxMeanB {m n : ℕ} :

Idx (ΓLN m n ++ [VecShape m, MatShape m n]) (MatShape m n)

Index of mean_b in ΓLN ++ [mean, mean_b].

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeCentered {m n : ℕ} :

Node (ΓLN m n ++ [VecShape m, MatShape m n]) (MatShape m n)

Center: centered := X - mean_b.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.g3 {m n : ℕ} :

Graph (ΓLN m n) [VecShape m, MatShape m n, MatShape m n]

Graph prefix producing [mean, mean_b, centered].

Instances For

def Proofs.Autograd.LayerNorm.idxCentered {m n : ℕ} :

Idx (ΓLN m n ++ [VecShape m, MatShape m n, MatShape m n]) (MatShape m n)

Index of centered in ΓLN ++ [mean, mean_b, centered].

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeCenteredSq {m n : ℕ} :

Node (ΓLN m n ++ [VecShape m, MatShape m n, MatShape m n]) (MatShape m n)

Square centered: centered_sq := centered ⊙ centered.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.g4 {m n : ℕ} :

Graph (ΓLN m n) [VecShape m, MatShape m n, MatShape m n, MatShape m n]

Graph prefix producing [mean, mean_b, centered, centered_sq].

Instances For

def Proofs.Autograd.LayerNorm.idxCenteredSq {m n : ℕ} :

Idx (ΓLN m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n]) (MatShape m n)

Index of centered_sq in the extended context.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeVar {m n : ℕ} :

Node (ΓLN m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n]) (VecShape m)

Variance per row: var := mean(centered_sq) producing a length-m vector.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.g5 {m n : ℕ} :

Graph (ΓLN m n) [VecShape m, MatShape m n, MatShape m n, MatShape m n, VecShape m]

Graph prefix producing [mean, mean_b, centered, centered_sq, var].

Instances For

def Proofs.Autograd.LayerNorm.idxVar {m n : ℕ} :

Idx (ΓLN m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n, VecShape m]) (VecShape m)

Index of var in the extended context.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeVarEps {m n : ℕ} (ε : ℝ) :

Node (ΓLN m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n, VecShape m]) (VecShape m)

Add epsilon: var_eps := var + ε.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.layerNormPrefix6 {m n : ℕ} (ε : ℝ) :

Graph (ΓLN m n) (ssPrefix6 m n)

Graph prefix computing the first 6 intermediates (ssPrefix6).

Instances For

def Proofs.Autograd.LayerNorm.idxVarEps {m n : ℕ} :

Idx (ΓLN m n ++ ssPrefix6 m n) (VecShape m)

Index of var_eps in ΓLN ++ ssPrefix6.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeStd {m n : ℕ} :

Node (ΓLN m n ++ ssPrefix6 m n) (VecShape m)

Standard deviation: std := sqrt_clamp(var_eps).

This is where the development becomes pointwise: differentiability depends on the (clamped) input.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.layerNormPrefix7 {m n : ℕ} (ε : ℝ) :

Graph (ΓLN m n) (ssPrefix7 m n)

Graph prefix computing ssPrefix7 (adds std).

Instances For

def Proofs.Autograd.LayerNorm.idxStd {m n : ℕ} :

Idx (ΓLN m n ++ ssPrefix7 m n) (VecShape m)

Index of std in ΓLN ++ ssPrefix7.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeInvStd {m n : ℕ} :

Node (ΓLN m n ++ ssPrefix7 m n) (VecShape m)

Inverse standard deviation: inv_std := 1/std.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.g8 {m n : ℕ} (ε : ℝ) :

Graph (ΓLN m n) (ssPrefix7 m n ++ [VecShape m])

Graph prefix adding inv_std.

Instances For

def Proofs.Autograd.LayerNorm.idxInvStd {m n : ℕ} :

Idx (ΓLN m n ++ (ssPrefix7 m n ++ [VecShape m])) (VecShape m)

Index of inv_std in the extended context.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeInvStdB {m n : ℕ} :

Node (ΓLN m n ++ (ssPrefix7 m n ++ [VecShape m])) (MatShape m n)

Broadcast inv_std back to m×n (row-wise).

Instances For

noncomputable def Proofs.Autograd.LayerNorm.g9 {m n : ℕ} (ε : ℝ) :

Graph (ΓLN m n) (ssPrefix7 m n ++ [VecShape m, MatShape m n])

Graph prefix adding inv_std_b.

Instances For

def Proofs.Autograd.LayerNorm.idxCentered9 {m n : ℕ} :

Idx (ΓLN m n ++ (ssPrefix7 m n ++ [VecShape m, MatShape m n])) (MatShape m n)

Index of centered in the stage-g9 context.

Instances For

def Proofs.Autograd.LayerNorm.idxInvStdB9 {m n : ℕ} :

Idx (ΓLN m n ++ (ssPrefix7 m n ++ [VecShape m, MatShape m n])) (MatShape m n)

Index of inv_std_b in the stage-g9 context.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeNorm {m n : ℕ} :

Node (ΓLN m n ++ (ssPrefix7 m n ++ [VecShape m, MatShape m n])) (MatShape m n)

Node computing normalized := centered ⊙ inv_std_b.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.g10 {m n : ℕ} (ε : ℝ) :

Graph (ΓLN m n) (ssPrefix7 m n ++ [VecShape m, MatShape m n, MatShape m n])

Graph prefix producing normalized := centered ⊙ inv_std_b.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeGammaB {m n : ℕ} :

Node (ΓLN m n ++ (ssPrefix7 m n ++ [VecShape m, MatShape m n, MatShape m n])) (MatShape m n)

Broadcast gamma to m×n (column-wise).

Instances For

noncomputable def Proofs.Autograd.LayerNorm.g11 {m n : ℕ} (ε : ℝ) :

Graph (ΓLN m n) (ssPrefix7 m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n])

Graph prefix adding gamma_b.

Instances For

def Proofs.Autograd.LayerNorm.idxNorm11 {m n : ℕ} :

Idx (ΓLN m n ++ (ssPrefix7 m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n])) (MatShape m n)

Index of normalized in the context at stage g11.

Instances For

def Proofs.Autograd.LayerNorm.idxGammaB11 {m n : ℕ} :

Idx (ΓLN m n ++ (ssPrefix7 m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n])) (MatShape m n)

Index of gamma_b at stage g11.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeScaled {m n : ℕ} :

Node (ΓLN m n ++ (ssPrefix7 m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n])) (MatShape m n)

Scale: scaled := normalized ⊙ gamma_b.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.g12 {m n : ℕ} (ε : ℝ) :

Graph (ΓLN m n) (ssPrefix7 m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n, MatShape m n])

Graph prefix adding scaled.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeBetaB {m n : ℕ} :

Node (ΓLN m n ++ (ssPrefix7 m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n, MatShape m n])) (MatShape m n)

Broadcast beta to m×n (column-wise).

Instances For

noncomputable def Proofs.Autograd.LayerNorm.g13 {m n : ℕ} (ε : ℝ) :

Graph (ΓLN m n) (ssPrefix7 m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n, MatShape m n, MatShape m n])

Graph prefix adding beta_b.

Instances For

def Proofs.Autograd.LayerNorm.idxScaled13 {m n : ℕ} :

Idx (ΓLN m n ++ (ssPrefix7 m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n, MatShape m n, MatShape m n])) (MatShape m n)

Index of scaled at stage g13.

Instances For

def Proofs.Autograd.LayerNorm.idxBetaB13 {m n : ℕ} :

Idx (ΓLN m n ++ (ssPrefix7 m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n, MatShape m n, MatShape m n])) (MatShape m n)

Index of beta_b at stage g13.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.nodeY {m n : ℕ} :

Node (ΓLN m n ++ (ssPrefix7 m n ++ [VecShape m, MatShape m n, MatShape m n, MatShape m n, MatShape m n, MatShape m n])) (MatShape m n)

Output: y := scaled + beta_b.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.layerNormGraph {m n : ℕ} (ε : ℝ) :

Graph (ΓLN m n) (ssLayerNorm m n)

Full LayerNorm graph (as an explicit snoc chain).

Instances For

noncomputable def Proofs.Autograd.LayerNorm.layerNormGraphFderivCorrectAt {m n : ℕ} (ε : ℝ) (xV : CtxVec (ΓLN m n)) (hVarEpsPos : ∀ (i : Fin (VecShape m).size), 0 < (CtxVec.get idxVarEps ((layerNormPrefix6 ε).evalVec xV)).ofLp i) (hStdNe0 : ∀ (i : Fin (VecShape m).size), (CtxVec.get idxStd ((layerNormPrefix7 ε).evalVec xV)).ofLp i ≠ 0) :

GraphFDerivCorrectAt (layerNormGraph ε) xV

Pointwise proof that layerNormGraph satisfies GraphFDerivCorrectAt.

The hypotheses hVarEpsPos and hStdNe0 are explicit domain assumptions ensuring that sqrt and inv are differentiable at the execution point.

Instances For

theorem Proofs.Autograd.LayerNorm.backprop_eq_adjoint_fderiv_layerNorm_at {m n : ℕ} (ε : ℝ) (xV : CtxVec (ΓLN m n)) (seedV : CtxVec (ΓLN m n ++ ssLayerNorm m n)) (hVarEpsPos : ∀ (i : Fin (VecShape m).size), 0 < (CtxVec.get idxVarEps ((layerNormPrefix6 ε).evalVec xV)).ofLp i) (hStdNe0 : ∀ (i : Fin (VecShape m).size), (CtxVec.get idxStd ((layerNormPrefix7 ε).evalVec xV)).ofLp i ≠ 0) :

(layerNormGraph ε).backpropVec xV seedV = (ContinuousLinearMap.adjoint (fderiv ℝ (layerNormGraph ε).evalVec xV)) seedV

Pointwise end-to-end result: backprop equals (fderiv eval)† for layerNormGraph.

The hypotheses hVarEpsPos and hStdNe0 are the explicit domain assumptions needed for differentiability of sqrt (after clamp) and inv at the actual execution point.

structure Proofs.Autograd.LayerNorm.Inputs (Γ : List Spec.Shape) (m n : ℕ) :

LayerNorm inputs inside an arbitrary tape context.

This is the model-level interface we use once LayerNorm is no longer the root graph. For example, in a post-norm Transformer block, x is the residual stream produced by an earlier SSA node, while gamma and beta are carried parameters in the surrounding context.

x : Idx Γ (MatShape m n)
Sequence/residual matrix normalized across its last axis.
gamma : Idx Γ (VecShape n)
Affine scale vector.
beta : Idx Γ (VecShape n)
Affine shift vector.

Instances For

@[reducible, inline]

abbrev Proofs.Autograd.LayerNorm.ssBeforeY (m n : ℕ) :

List Spec.Shape

Saved tensors before the final LayerNorm output y.

Instances For

def Proofs.Autograd.LayerNorm.idxY {m n : ℕ} :

Idx (ΓLN m n ++ ssLayerNorm m n) (MatShape m n)

Index of the final LayerNorm output in ΓLN ++ ssLayerNorm.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.packInputsCLM {Γ : List Spec.Shape} {m n : ℕ} (inputs : Inputs Γ m n) :

CtxVec Γ →L[ℝ ] CtxVec (ΓLN m n)

Linear map that packs arbitrary-context LayerNorm inputs into the canonical context [X, gamma, beta].

Instances For

noncomputable def Proofs.Autograd.LayerNorm.outputCLM {m n : ℕ} :

CtxVec (ΓLN m n ++ ssLayerNorm m n) →L[ℝ ] Vec (MatShape m n).size

Project the final LayerNorm output from the full canonical graph context.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.wholeNode {Γ : List Spec.Shape} {m n : ℕ} (inputs : Inputs Γ m n) (ε : ℝ) :

Node Γ (MatShape m n)

LayerNorm as one reusable pointwise node over arbitrary context indices.

Internally this node runs the already-proved detailed LayerNorm graph. Its JVP is defined as the Fréchet derivative of that composed map at the current point, and its VJP is the adjoint of that derivative. This is exactly the block-level abstraction needed for large model proofs: the detailed LayerNorm proof remains in this file, while Transformer/GPT/ViT proofs can treat LayerNorm as a single pointwise node with explicit domain assumptions.

Instances For

noncomputable def Proofs.Autograd.LayerNorm.wholeNodeFDerivCorrectAt {Γ : List Spec.Shape} {m n : ℕ} (inputs : Inputs Γ m n) (ε : ℝ) (xV : CtxVec Γ) (hVarEpsPos : ∀ (i : Fin (VecShape m).size), 0 < (CtxVec.get idxVarEps ((layerNormPrefix6 ε).evalVec ((packInputsCLM inputs) xV))).ofLp i) (hStdNe0 : ∀ (i : Fin (VecShape m).size), (CtxVec.get idxStd ((layerNormPrefix7 ε).evalVec ((packInputsCLM inputs) xV))).ofLp i ≠ 0) :

NodeFDerivCorrectAt (wholeNode inputs ε) xV

Pointwise derivative certificate for wholeNode.

The hypotheses are the same LayerNorm domain conditions as the detailed graph theorem, but evaluated after packing the arbitrary context into [X, gamma, beta].

Instances For