LayerNorm #
Pointwise analytic correctness for a LayerNorm graph.
This is spec-level over ℝ. It is the proof-tape counterpart of the runtime/spec LayerNorm in
Spec.layerNorm: a seqLen × embedDim tensor is normalized across the last axis, the row-wise
normalizer is broadcast back over each token, and affine parameters gamma/beta are broadcast
over the sequence dimension. The runtime API and compiled IR path both route through that spec
definition; this file proves the corresponding reverse-mode graph rule.
Because the proof graph uses the differentiable scalar nodes sqrt (max x 0) and inv, the main
theorem is pointwise (GraphFDerivCorrectAt) with explicit domain assumptions. Those hypotheses
are the honest mathematical boundary: away from the clamp kink and zero denominator, backprop is the
adjoint of the Fréchet derivative. The executable Spec.layerNorm additionally clamps the raw
variance before adding epsilon as a numerical guard; over exact real variance this is the same
contract on the positive branch used by the proof.
PyTorch correspondence / citations #
- Conceptually corresponds to
torch.nn.LayerNorm(without batching/running stats): normalize along the last dimension, then apply affine parameters(gamma,beta). https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html
Matrix shape m×n.
Instances For
Vector shape k.
Instances For
Input context shapes: [X, gamma, beta] for layer norm over the last axis.
Instances For
First 6 intermediates in the LayerNorm computation (up to var_eps).
Instances For
Prefix intermediates up to std (adds one more vector).
Instances For
Full list of intermediates for the LayerNorm graph in this file.
Instances For
Index of the input matrix X in the base LayerNorm context ΓLN m n ++ ss.
Instances For
Index helper for the last element of an extended context Γ ++ ss ++ [τ].
Instances For
Full LayerNorm graph (as an explicit snoc chain).
Instances For
Pointwise proof that layerNormGraph satisfies GraphFDerivCorrectAt.
The hypotheses hVarEpsPos and hStdNe0 are explicit domain assumptions ensuring that sqrt and
inv are differentiable at the execution point.
Instances For
Pointwise end-to-end result: backprop equals (fderiv eval)† for layerNormGraph.
The hypotheses hVarEpsPos and hStdNe0 are the explicit domain assumptions needed for
differentiability of sqrt (after clamp) and inv at the actual execution point.
LayerNorm inputs inside an arbitrary tape context.
This is the model-level interface we use once LayerNorm is no longer the root graph. For example,
in a post-norm Transformer block, x is the residual stream produced by an earlier SSA node, while
gamma and beta are carried parameters in the surrounding context.
Sequence/residual matrix normalized across its last axis.
Affine scale vector.
Affine shift vector.
Instances For
Saved tensors before the final LayerNorm output y.
Instances For
Index of the final LayerNorm output in ΓLN ++ ssLayerNorm.
Instances For
LayerNorm as one reusable pointwise node over arbitrary context indices.
Internally this node runs the already-proved detailed LayerNorm graph. Its JVP is defined as the Fréchet derivative of that composed map at the current point, and its VJP is the adjoint of that derivative. This is exactly the block-level abstraction needed for large model proofs: the detailed LayerNorm proof remains in this file, while Transformer/GPT/ViT proofs can treat LayerNorm as a single pointwise node with explicit domain assumptions.
Instances For
Pointwise derivative certificate for wholeNode.
The hypotheses are the same LayerNorm domain conditions as the detailed graph theorem, but evaluated
after packing the arbitrary context into [X, gamma, beta].