TorchLean API

Docs Home Guide Examples Graphs

NN.Proofs.Autograd.Tape.Ops.Attention.ScaledDotProduct

ScaledDotProduct #

End-to-end fderiv/backprop correctness for a scaled dot-product attention graph, built out of the proven tape nodes (matmul, matrix_transpose, scale, softmax_last).

This is spec-level over ℝ. It is a corollary of the general graph theorem once each node used by the graph has a NodeFDerivCorrect instance.

PyTorch correspondence / citations #

This file matches the usual mathematical definition of scaled dot-product attention: softmax(c * Q Kᵀ) V. In PyTorch this corresponds to building the same computation with torch.matmul + torch.softmax, or using the dedicated helper: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

@[reducible, inline]

abbrev Proofs.Autograd.Attention.QKVShape (m d : ℕ) :

Matrix shape for m×d Q/K/V inputs.

Instances For

@[reducible, inline]

abbrev Proofs.Autograd.Attention.ΓQKV (m d : ℕ) :

List Spec.Shape

Input context shapes: [Q, K, V], each m×d.

Instances For

def Proofs.Autograd.Attention.idxQ {m d : ℕ} {ss : List Spec.Shape} :

Idx (ΓQKV m d ++ ss) (QKVShape m d)

Context index of Q in ΓQKV.

Instances For

def Proofs.Autograd.Attention.idxK {m d : ℕ} {ss : List Spec.Shape} :

Idx (ΓQKV m d ++ ss) (QKVShape m d)

Context index of K in ΓQKV.

Instances For

def Proofs.Autograd.Attention.idxV {m d : ℕ} {ss : List Spec.Shape} :

Idx (ΓQKV m d ++ ss) (QKVShape m d)

Context index of V in ΓQKV.

Instances For

def Proofs.Autograd.Attention.idxLast {Γ ss : List Spec.Shape} {τ : Spec.Shape} :

Idx (Γ ++ ss ++ [τ]) τ

Index of the most-recently appended tensor in a DGraph context.

Instances For

noncomputable def Proofs.Autograd.Attention.scaledDotProductDGraph {m d : ℕ} (c : ℝ) :

Scaled dot-product attention as a proved-correct DGraph.

Computes Q K V ↦ softmax(c * (Q * Kᵀ)) * V and records intermediate values needed by backprop.

Instances For

theorem Proofs.Autograd.Attention.backprop_eq_adjoint_fderiv_scaledDotProduct {m d : ℕ} (c : ℝ) (xV : CtxVec (ΓQKV m d)) (seedV : CtxVec (ΓQKV m d ++ [Spec.Shape.dim d (Spec.Shape.dim m Spec.Shape.scalar), Spec.Shape.dim m (Spec.Shape.dim m Spec.Shape.scalar), Spec.Shape.dim m (Spec.Shape.dim m Spec.Shape.scalar), Spec.Shape.dim m (Spec.Shape.dim m Spec.Shape.scalar), Spec.Shape.dim m (Spec.Shape.dim d Spec.Shape.scalar)])) :

(scaledDotProductDGraph c).g.backpropVec xV seedV = (ContinuousLinearMap.adjoint (fderiv ℝ (scaledDotProductDGraph c).g.evalVec xV)) seedV

Corollary of the general DAG theorem: backprop equals (fderiv eval)† for the attention graph.

This is the formal statement that the tape reverse pass computes the VJP for the full attention computation.