Residual Attention Blocks #
This file proves the next composition step after full multi-head self-attention:
x ↦ x + MHA(x).
That residual add is the first half of a post-norm Transformer encoder sublayer:
LayerNorm(x + MultiHeadSelfAttention(x)).
The existing MHA theorem already proves that the attention graph's reverse pass is the adjoint of the Fréchet derivative. Here we append the residual add as one more proved tape node, giving a reusable graph theorem for the residual stream that is passed to post-norm LayerNorm in the runtime Transformer blocks.
References:
- Vaswani et al., "Attention Is All You Need", NeurIPS 2017.
- PyTorch
torch.nn.MultiheadAttention: https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html - PyTorch
torch.nn.TransformerEncoderLayer: https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html
Intermediate list for MHA followed by one residual-add output.
Instances For
Original sequence input x, weakened into the context after the MHA intermediates.
Instances For
The final output of mhaDGraph, i.e. the projected attention result.
The literal index is intentional: MultiHeadAttention.ssMHA is a fixed 14-entry saved-tensor list, and the final
entry is the attention output with the same shape as the input sequence.
Instances For
Proof-carrying graph for x + MHA(x).
Context layout is inherited from MHA:
[x, Wq, Wk, Wv, Wo].
Instances For
End-to-end VJP theorem for the residual-attention sublayer x + MHA(x).
This is the proved residual-stream component used by post-norm Transformer blocks. LayerNorm itself
has its own current-spec VJP theorem in NN.Proofs.Autograd.Tape.Ops.Norm.LayerNorm; the composed
post-norm attention sublayer and the two-sublayer post-norm bridge are packaged in
NN.Proofs.Autograd.Tape.Ops.Transformer.PostNorm.