Attention weights sum to 1 (spec layer, ℝ) #
For unmasked scaled dot-product attention, the attention weights are computed by applying
Activation.softmax_spec row-wise to the scaled score matrix.
This file records the normalization theorem that each row of those unmasked weights sums to 1.
Masked attention uses hard-mask semantics; its normalization theorem needs the additional
well-formedness assumption that each row admits at least one key.
theorem
NN.Proofs.Models.Attention.scaledDotProductAttention_unmasked_weights_row_sum_one
{nQ nK dModel : ℕ}
{hQ : nQ ≠ 0}
{hK : nK ≠ 0}
(ctx : Spec.AttentionContext ℝ nQ nK dModel hQ hK)
(i : Fin nQ)
:
have scale := MathFunctions.sqrt ↑dModel;
have scores := Spec.matMulSpec ctx.Q ctx.K.matrixTransposeSpec;
have scaledScores := scores.scaleSpec (1 / scale);
have attentionWeights := Activation.softmaxSpec scaledScores;
(Spec.get attentionWeights i).sumSpec = 1
In unmasked scaled dot-product attention, each query row of the attention-weight matrix sums to 1.
This is purely a property of softmax_spec; it does not depend on the particular choice of scores,
or scaling.