MultiHeadSelfAttention #

End-to-end fderiv/backprop correctness for a Multi-Head Self-Attention graph, decomposed into proven tape nodes:

linear projections via matmul,
head split/merge via reshape + swap_first_two3d,
attention core via batched matmul + transpose3d_last_two + scale + batched softmax_last.

This is spec-level over ℝ. It is a corollary of the general graph theorem once each node used by the graph has a NodeFDerivCorrect instance.

PyTorch correspondence / citations #

The construction matches the usual “project → split heads → scaled dot-product attention → concat heads → output projection” pipeline used by torch.nn.MultiheadAttention. https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html
The core attention step corresponds to torch.nn.functional.scaled_dot_product_attention. https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

source

@[reducible, inline]

abbrev Proofs.Autograd.MultiHeadAttention.XShape (n dModel : ℕ) :

Spec.Shape

Sequence input shape n×dModel.

Instances For

source

@[reducible, inline]

abbrev Proofs.Autograd.MultiHeadAttention.BigShape (n numHeads headDim : ℕ) :

Spec.Shape

Concatenated-head representation n×(numHeads*headDim).

Instances For

source

@[reducible, inline]

abbrev Proofs.Autograd.MultiHeadAttention.HeadsShape (n numHeads headDim : ℕ) :

Spec.Shape

Split-head representation (numHeads)×n×headDim.

Instances For

source

@[reducible, inline]

abbrev Proofs.Autograd.MultiHeadAttention.KtShape (n numHeads headDim : ℕ) :

Spec.Shape

Key-transposed shape (numHeads)×headDim×n used for Q Kᵀ.

Instances For

source

@[reducible, inline]

abbrev Proofs.Autograd.MultiHeadAttention.ScoresShape (n numHeads : ℕ) :

Spec.Shape

Attention scores shape (numHeads)×n×n.

Instances For

source

@[reducible, inline]

abbrev Proofs.Autograd.MultiHeadAttention.SwappedShape (n numHeads headDim : ℕ) :

Spec.Shape

Intermediate shape after swapping axes for concatenation n×numHeads×headDim.

Instances For

source

@[reducible, inline]

abbrev Proofs.Autograd.MultiHeadAttention.ssMHA (n dModel numHeads headDim : ℕ) :

List Spec.Shape

Intermediate node output shapes (tape “saved tensors”) for the MHA graph.

Instances For

source

@[reducible, inline]

abbrev Proofs.Autograd.MultiHeadAttention.WqShape (dModel numHeads headDim : ℕ) :

Spec.Shape

Projection weight shape dModel×(numHeads*headDim) (used for Q/K/V).

Instances For

source

@[reducible, inline]

abbrev Proofs.Autograd.MultiHeadAttention.WoShape (dModel numHeads headDim : ℕ) :

Spec.Shape

Output projection weight shape (numHeads*headDim)×dModel.

Instances For

source

@[reducible, inline]

abbrev Proofs.Autograd.MultiHeadAttention.ΓMHA (n dModel numHeads headDim : ℕ) :

List Spec.Shape

Input context shapes: [x, Wq, Wk, Wv, Wo].

Instances For

source

def Proofs.Autograd.MultiHeadAttention.idxX {n dModel numHeads headDim : ℕ} {ss : List Spec.Shape} :

Idx (ΓMHA n dModel numHeads headDim ++ ss) (XShape n dModel)

Context index of the sequence input x in ΓMHA.

Instances For

source

def Proofs.Autograd.MultiHeadAttention.idxWq {n dModel numHeads headDim : ℕ} {ss : List Spec.Shape} :

Idx (ΓMHA n dModel numHeads headDim ++ ss) (WqShape dModel numHeads headDim)

Context index of Wq in ΓMHA.

Instances For

source

def Proofs.Autograd.MultiHeadAttention.idxWk {n dModel numHeads headDim : ℕ} {ss : List Spec.Shape} :

Idx (ΓMHA n dModel numHeads headDim ++ ss) (WqShape dModel numHeads headDim)

Context index of Wk in ΓMHA.

Instances For

source

def Proofs.Autograd.MultiHeadAttention.idxWv {n dModel numHeads headDim : ℕ} {ss : List Spec.Shape} :

Idx (ΓMHA n dModel numHeads headDim ++ ss) (WqShape dModel numHeads headDim)

Context index of Wv in ΓMHA.

Instances For

source

def Proofs.Autograd.MultiHeadAttention.idxWo {n dModel numHeads headDim : ℕ} {ss : List Spec.Shape} :

Idx (ΓMHA n dModel numHeads headDim ++ ss) (WoShape dModel numHeads headDim)

Context index of Wo in ΓMHA.

Instances For

source

def Proofs.Autograd.MultiHeadAttention.idxLast {Γ ss : List Spec.Shape} {τ : Spec.Shape} :

Idx (Γ ++ ss ++ [τ]) τ

Index of the most-recently appended tensor in a DGraph context.

Instances For

source

theorem Proofs.Autograd.MultiHeadAttention.size_big_to_heads (n numHeads headDim : ℕ) :

(BigShape n numHeads headDim).size = (HeadsShape n numHeads headDim).size

source

theorem Proofs.Autograd.MultiHeadAttention.size_swap_to_concat (n numHeads headDim : ℕ) :

(Spec.Shape.dim n (Spec.Shape.dim numHeads (Spec.Shape.dim headDim Spec.Shape.scalar))).size = (BigShape n numHeads headDim).size

source

noncomputable def Proofs.Autograd.MultiHeadAttention.mhaDGraph {n dModel numHeads headDim : ℕ} (c : ℝ) :

DGraph (ΓMHA n dModel numHeads headDim) (ssMHA n dModel numHeads headDim)

Multi-head self-attention as a proof-carrying graph.

This implements: x Wq Wk Wv Wo ↦ Wo (concat_heads (softmax(c * (Q Kᵀ)) V)), with Q/K/V projected from x.

The graph is laid out to match typical runtime implementations: view(...).transpose(...) is modeled by reshape + swap_first_two3d.

Instances For

source

theorem Proofs.Autograd.MultiHeadAttention.mha_backpropVec_eq_adjoint_fderiv {n dModel numHeads headDim : ℕ} (c : ℝ) (xV : CtxVec (ΓMHA n dModel numHeads headDim)) (seedV : CtxVec (ΓMHA n dModel numHeads headDim ++ ssMHA n dModel numHeads headDim)) :

(mhaDGraph c).g.backpropVec xV seedV = (ContinuousLinearMap.adjoint (fderiv ℝ (mhaDGraph c).g.evalVec xV)) seedV

Corollary of the general DAG theorem: backprop equals (fderiv eval)† for the MHA graph.

This is the formal “VJP correctness” statement for the full MHA computation (as laid out by mhaDGraph).

TorchLean API

NN.Proofs.Autograd.Tape.Ops.Attention.MultiHeadSelfAttention

MultiHeadSelfAttention #

PyTorch correspondence / citations #