Post-Norm Transformer Encoder Block #
This module names the full two-sublayer post-norm encoder-block theorem.
The theorem is stated at the block boundary:
x ↦ LN₂(FFN(LN₁(MHA(x) + x)) + LN₁(MHA(x) + x)).
The two sublayers already have graph-level VJP theorems in PostNorm.lean. The theorem here is the
composition result that a model proof imports when it needs the whole encoder block as one
differentiable map. A future lowering pass can still build one concrete SSA graph for the exact
runtime layout; this theorem is the mathematical block contract that such a graph must implement.
Concrete SSA encoder block #
The definitions below assemble one executable-style proof graph for a post-norm encoder block. The context is:
[x, Wq, Wk, Wv, Wo, gamma₁, beta₁, gamma₂, beta₂].
The FFN affine maps are supplied as fixed sequence-level linear maps, matching the interface used by
seqFfnResidualDGraph.
Full encoder-block context: MHA parameters, first LayerNorm affine parameters, second LayerNorm affine parameters.
Instances For
Saved tensors for one post-norm encoder block.
Instances For
First LayerNorm scale parameter in the full encoder-block context.
Instances For
First LayerNorm shift parameter in the full encoder-block context.
Instances For
Second LayerNorm scale parameter in the full encoder-block context.
Instances For
Second LayerNorm shift parameter in the full encoder-block context.
Instances For
Residual-attention prefix while carrying both LayerNorm parameter pairs.
Instances For
Residual stream x + MHA(x) after the attention prefix.
Instances For
First LayerNorm input triple in the concrete encoder block.
Instances For
Encoder graph through the first post-norm attention sublayer.
Instances For
First post-norm sublayer output.
Instances For
First post-norm output weakened through the FFN intermediates.
Instances For
First FFN affine output in the full encoder graph.
Instances For
FFN activation output in the full encoder graph.
Instances For
FFN projection output in the full encoder graph.
Instances For
FFN residual output after the second sublayer residual add.
Instances For
Graph through the FFN residual part after the first LayerNorm.
Instances For
Second LayerNorm input triple in the concrete encoder block.
Instances For
Concrete SSA graph for one full post-norm Transformer encoder block.
Instances For
Pointwise analytic correctness for the graph through the FFN residual.
Instances For
Pointwise analytic correctness for the complete concrete encoder-block graph.
Instances For
End-to-end VJP theorem for the concrete post-norm Transformer encoder-block graph.
Fréchet differentiability of a complete post-norm Transformer encoder block.
attnPack builds the first LayerNorm input triple from the outer model context. ffnPack builds
the second LayerNorm input triple after the first post-norm sublayer has evaluated. The LayerNorm
side conditions are local to the two concrete normalization calls.