Attention weights sum to 1 (spec layer, `ℝ`) #

For unmasked scaled dot-product attention, the attention weights are computed by applying Activation.softmax_spec row-wise to the scaled score matrix.

This file records the normalization theorem that each row of those unmasked weights sums to 1. Masked attention uses hard-mask semantics; its normalization theorem needs the additional well-formedness assumption that each row admits at least one key.

source

theorem NN.Proofs.Models.Attention.scaledDotProductAttention_unmasked_weights_row_sum_one {nQ nK dModel : ℕ} {hQ : nQ ≠ 0} {hK : nK ≠ 0} (ctx : Spec.AttentionContext ℝ nQ nK dModel hQ hK) (i : Fin nQ) :

have scale := MathFunctions.sqrt ↑dModel; have scores := Spec.matMulSpec ctx.Q ctx.K.matrixTransposeSpec; have scaledScores := scores.scaleSpec (1 / scale); have attentionWeights := Activation.softmaxSpec scaledScores; (Spec.get attentionWeights i).sumSpec = 1

In unmasked scaled dot-product attention, each query row of the attention-weight matrix sums to 1.

This is purely a property of softmax_spec; it does not depend on the particular choice of scores, or scaling.

TorchLean API

NN.Proofs.Models.Attention.Weights

Attention weights sum to 1 (spec layer, `ℝ`) #

Attention weights sum to 1 (spec layer, ℝ) #

Attention weights sum to 1 (spec layer, `ℝ`) #