Masked Scaled-Dot-Product Attention #

This file proves the differentiable fixed-mask form used by GPT-style attention blocks:

softmax(c · QKᵀ + bias) V.

The bias tensor is fixed data. A causal mask can instantiate it with 0 on allowed entries and a large negative finite value on blocked entries, matching the finite-mask convention used by many runtimes. This theorem is not the hard -∞ masking limit; it is the exact reverse-mode theorem for the finite additive-mask computation.

source

@[reducible, inline]

abbrev Proofs.Autograd.Attention.ssMaskedScaledDotProduct (m d : ℕ) :

List Spec.Shape

Saved tensors for fixed-bias scaled-dot-product attention.

Instances For

source

noncomputable def Proofs.Autograd.Attention.maskedScaledDotProductDGraph {m d : ℕ} (c : ℝ) (bias : Vec (Spec.Shape.dim m (Spec.Shape.dim m Spec.Shape.scalar)).size := 0) :

DGraph (ΓQKV m d) (ssMaskedScaledDotProduct m d)

Scaled dot-product attention with a fixed additive score bias.

The proof follows the unmasked graph with one extra affine identity node between scaling and softmax. Because the bias is fixed, its derivative is the identity on the scaled logits.

Instances For

source

theorem Proofs.Autograd.Attention.backprop_eq_adjoint_fderiv_maskedScaledDotProduct {m d : ℕ} (c : ℝ) (bias : Vec (Spec.Shape.dim m (Spec.Shape.dim m Spec.Shape.scalar)).size := 0) (xV : CtxVec (ΓQKV m d)) (seedV : CtxVec (ΓQKV m d ++ ssMaskedScaledDotProduct m d)) :

(maskedScaledDotProductDGraph c bias).g.backpropVec xV seedV = (ContinuousLinearMap.adjoint (fderiv ℝ (maskedScaledDotProductDGraph c bias).g.evalVec xV)) seedV

Reverse-mode theorem for finite additive-mask scaled-dot-product attention.

TorchLean API

NN.Proofs.Autograd.Tape.Ops.Attention.MaskedScaledDotProduct

Masked Scaled-Dot-Product Attention #