Masked Multi-Head Attention Core #
This module proves the head-wise finite-mask attention core used inside causal Transformer blocks.
The surrounding projection/split/merge graph lives in MultiHeadSelfAttention.lean; the theorem here
is the reusable masked replacement for the score/probability/value part of that graph:
softmax(c · QKᵀ + bias) V.
The mask is a fixed finite score bias with shape (heads × seq × seq). This covers the
differentiable finite-mask convention used by GPT-style causal attention. A true -∞ hard mask is
a separate semantic theorem, not the differentiable runtime computation proved here.
Context for a masked multi-head attention core: [Q_heads, Kᵀ_heads, V_heads].
Instances For
Saved tensors for the fixed-bias multi-head attention core.
Instances For
Query-head index in the masked attention core context.
Instances For
Transposed-key index in the masked attention core context.
Instances For
Value-head index in the masked attention core context.
Instances For
Proof-carrying masked multi-head attention core.
The fixed bias is added after scaling the score tensor and before the row-wise softmax.
Instances For
Reverse-mode theorem for the fixed-bias multi-head attention core.
Composition theorem for the masked core after a projection/split stage.
projectPack is the mathematical interface exposed by a full MHA front half: it builds
[Q_heads, Kᵀ_heads, V_heads] from an outer context. The theorem composes that front half with the
proved finite-mask attention core.
Full masked-attention composition contract.
The theorem separates the already-proved pieces of a GPT-style attention block:
- a differentiable front half that projects/splits inputs into
Q,Kᵀ, andV; - the proved finite-mask split-head attention core;
- a differentiable back half that merges/project outputs or packages residual data for the caller.
Instantiating projectPack and mergePack with the concrete projection/split/merge graphs gives
the full masked-MHA differentiability statement without changing the core proof.