GPT-Style Decoder Block #
This module packages the post-norm GPT decoder-block composition theorem. The masked attention
front half is supplied as a differentiable residual-pack map; the concrete finite-mask attention
core and its projection/merge composition theorem live in
NN.Proofs.Autograd.Tape.Ops.Attention.MaskedMultiHeadSelfAttention.
Concrete finite-mask decoder-core SSA graph #
The concrete graph below starts after the Q/K/V projection split:
[Q_heads, Kᵀ_heads, V_heads, residual_stream, gamma₁, beta₁, gamma₂, beta₂].
It then runs finite-mask split-head attention, merges the head output through a supplied affine map, adds the residual stream, and applies the two post-norm sublayers. A separate projection theorem can feed this graph from a full token/parameter context.
Concrete decoder-core context: masked attention core inputs, residual stream, and two LayerNorm parameter pairs.
Instances For
Saved tensors for the concrete finite-mask decoder-core block.
Instances For
Residual-stream input in the decoder-core context.
Instances For
First LayerNorm scale parameter in the decoder-core context.
Instances For
First LayerNorm shift parameter in the decoder-core context.
Instances For
Second LayerNorm scale parameter in the decoder-core context.
Instances For
Second LayerNorm shift parameter in the decoder-core context.
Instances For
Masked attention core while carrying residual and LayerNorm parameters.
Instances For
Split-head masked attention output after the masked core.
Instances For
Merged attention output after the supplied output projection.
Instances For
Residual input weakened past the merged attention output.
Instances For
Residual attention stream x + masked_attention(x) before the first LayerNorm.
Instances For
First LayerNorm input triple for the concrete decoder-core graph.
Instances For
Decoder graph through the first post-norm masked-attention sublayer.
Instances For
First decoder post-norm output.
Instances For
First decoder post-norm output weakened through FFN intermediates.
Instances For
First FFN affine output in the concrete decoder graph.
Instances For
FFN activation output in the concrete decoder graph.
Instances For
FFN projection output in the concrete decoder graph.
Instances For
FFN residual output before the second decoder LayerNorm.
Instances For
Decoder graph through the FFN residual.
Instances For
Second LayerNorm input triple for the concrete decoder-core graph.
Instances For
Concrete SSA graph for one finite-mask GPT-style decoder-core block.
Instances For
Pointwise analytic correctness for the decoder graph through the FFN residual.
Instances For
Pointwise analytic correctness for the complete concrete decoder-core graph.
Instances For
End-to-end VJP theorem for the concrete finite-mask GPT-style decoder-core graph.
Projection-to-residual bridge for a GPT-style masked decoder attention sublayer.
The concrete decoder-core graph above starts from already split Q, Kᵀ, and V heads. This
theorem is the reusable front-end hook for full GPT blocks: any differentiable projection/split
stage may build those heads, and any differentiable merge/residual pack may turn the masked
attention trace into the first LayerNorm input triple [x + MaskedMHA(x), gamma₁, beta₁].
Combine this theorem with postNormGptDecoderBlock_hasFDerivAt below to get the full projected
finite-mask decoder-block differentiability statement.
Fréchet differentiability of a GPT-style post-norm decoder block.
maskedAttentionPack builds the first LayerNorm input triple
[x + MaskedMHA(x), gamma₁, beta₁]. Instantiate its differentiability hypothesis with
MultiHeadAttention.projectedMaskedAttention_hasFDerivAt when the attention sublayer is built from
the proved finite-mask split-head core.