TorchLean API

NN.Proofs.Autograd.Tape.Ops.Attention.MaskedScaledDotProduct

Masked Scaled-Dot-Product Attention #

This file proves the differentiable fixed-mask form used by GPT-style attention blocks:

softmax(c · QKᵀ + bias) V.

The bias tensor is fixed data. A causal mask can instantiate it with 0 on allowed entries and a large negative finite value on blocked entries, matching the finite-mask convention used by many runtimes. This theorem is not the hard -∞ masking limit; it is the exact reverse-mode theorem for the finite additive-mask computation.

@[reducible, inline]

Saved tensors for fixed-bias scaled-dot-product attention.

Instances For

    Scaled dot-product attention with a fixed additive score bias.

    The proof follows the unmasked graph with one extra affine identity node between scaling and softmax. Because the bias is fixed, its derivative is the identity on the scaled logits.

    Instances For

      Reverse-mode theorem for finite additive-mask scaled-dot-product attention.