Post-Norm Transformer Sublayers #
This file packages the LayerNorm theorem at the interface used by post-norm Transformer blocks.
The preceding files prove the smooth residual components:
x ↦ x + MultiHeadSelfAttention(x);x ↦ x + W₂ GELU(W₁x + b₁) + b₂.
This module proves the next runtime-facing boundary:
residual_stream ↦ LayerNorm(residual_stream, gamma, beta).
That is the exact post-norm sublayer shape used by classical Transformer encoder blocks
(LayerNorm(x + Sublayer(x))). We deliberately keep this proof factored at the residual-stream
interface. It avoids treating LayerNorm's pointwise domain hypotheses as globally smooth,
and it gives later full-block proofs a clean seam: compose a globally smooth residual graph with
this pointwise post-norm graph once the context-threading adapter for unused parameters is in place.
References:
- Vaswani et al., "Attention Is All You Need", NeurIPS 2017.
- PyTorch
torch.nn.TransformerEncoderLayer. https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html - PyTorch
torch.nn.LayerNorm. https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html
A post-norm Transformer stage normalizes a sequence-shaped residual stream.
Instances For
Saved tensors for the LayerNorm part of a post-norm Transformer stage.
Instances For
MHA context extended with the affine parameters for the following LayerNorm.
Instances For
Residual-MHA plus one whole-node post-norm output.
Instances For
Gamma parameter for the post-norm LayerNorm, weakened through any saved tensors.
Instances For
Beta parameter for the post-norm LayerNorm, weakened through any saved tensors.
Instances For
The residual stream x + MHA(x) after the residual-attention prefix graph.
Instances For
LayerNorm input triple after the residual-MHA prefix has run.
Instances For
Residual-MHA graph while carrying the following LayerNorm's affine parameters.
The carried gamma/beta are not read by attention; DGraph.weakenContext ensures their gradients
from the attention prefix are zero, while still keeping them available to the appended LayerNorm
node.
Instances For
Single SSA graph for the first post-norm Transformer encoder sublayer:
LayerNorm(x + MultiHeadSelfAttention(x), gamma, beta).
LayerNorm is appended as a whole pointwise node backed by the detailed LayerNorm graph theorem.
Instances For
Pointwise correctness for the single-graph residual-MHA post-norm sublayer.
Instances For
End-to-end VJP theorem for the single-graph residual-MHA post-norm sublayer.
Sequence feed-forward plus post-norm #
This is the second sublayer shape in a post-norm Transformer encoder block:
LayerNorm(X + FFN(X), gamma, beta).
The FFN residual is sequence-shaped, not merely one-token-shaped. Its affine maps are supplied over flattened sequence tensors, so a shared position-wise implementation or a fused backend can both instantiate the theorem by exposing their fixed affine maps.
Sequence-FFN context extended with the affine parameters for the following LayerNorm.
Instances For
Sequence-FFN residual plus one whole-node post-norm output.
Instances For
Gamma parameter for the FFN post-norm LayerNorm.
Instances For
Beta parameter for the FFN post-norm LayerNorm.
Instances For
The residual stream X + FFN(X) after the sequence-FFN prefix graph.
Instances For
LayerNorm input triple after the residual-FFN prefix has run.
Instances For
Sequence-FFN graph while carrying the following LayerNorm's affine parameters.
Instances For
Single SSA graph for the second post-norm Transformer encoder sublayer:
LayerNorm(X + FFN(X), gamma, beta).
Instances For
Pointwise correctness for the single-graph residual-FFN post-norm sublayer.
Instances For
End-to-end VJP theorem for the single-graph residual-FFN post-norm sublayer.
The post-norm graph itself.
The context is [residual_stream, gamma, beta]. For attention this residual stream is
x + MHA(x); for the feed-forward half it is x + FFN(x). The residual computation is proved in
ResidualAttention/FeedForward; this graph proves the LayerNorm boundary that follows it.
Instances For
Pointwise correctness for the post-norm Transformer boundary.
The two hypotheses are exactly LayerNorm's differentiability side conditions at the runtime point: the variance-plus-epsilon branch is positive, and the standard deviation denominator is nonzero.
Instances For
VJP theorem for the post-norm Transformer boundary.
This is the model-level theorem used after either residual attention or a residual feed-forward block has produced its sequence-shaped residual stream.
Calculus bridge for a residual block followed by post-norm LayerNorm.
This is the theorem we use to move from separately proved pieces to a whole Transformer sublayer. Suppose some residual-producing map
residualPack : E → [residual_stream, gamma, beta]
is differentiable at x. It may come from residual attention, residual feed-forward, or any future
block that produces the same LayerNorm context. If the LayerNorm domain hypotheses hold at
residualPack x, then the composed post-norm map
x ↦ LayerNorm(residualPack x)
is differentiable, with derivative given by the usual chain rule.
This is intentionally more general than MHA: the same theorem covers Transformer, ViT, GPT-style blocks, and future residual modules once they expose the residual stream plus affine LayerNorm parameters.
Calculus bridge for a full two-sublayer post-norm Transformer encoder block.
This theorem is deliberately stated at the map level rather than for one concrete SSA graph. A post-norm encoder block has two domain-sensitive LayerNorms:
attnPackbuilds[x + MHA(x), gamma₁, beta₁];- after the first LayerNorm evaluation,
ffnPackbuilds[norm₁ + FFN(norm₁), gamma₂, beta₂]; - the second LayerNorm produces the block output.
The theorem says that if the two residual-pack maps are differentiable and both LayerNorm calls satisfy their local denominator hypotheses, then the whole two-sublayer block is differentiable by ordinary Fréchet chain rule.
The concrete graph-level VJP theorems for each sublayer are mhaPostNorm_* and seqFfnPostNorm_*.
This bridge is the public mathematical composition point for Transformer, ViT, and GPT-style
post-norm blocks while the final monolithic SSA graph is assembled.
Named theorem for the post-normalized residual-attention interface.
The residual attention graph proves production of the first input in this context:
residual_stream = x + MHA(x). This theorem proves the LayerNorm pass once that residual stream is
the current tensor.
Named theorem for the post-normalized residual feed-forward interface.
The position-wise FFN proof establishes the smooth residual update. This theorem is the common LayerNorm boundary used after that update in post-norm encoder blocks.