Causal attention mask laws #

This file proves the exact Boolean semantics of TorchLean's causal and future masks and connects those mask facts to the true hard-masked attention primitive.

TorchLean's main attention spec uses the proof-facing semantics corresponding to scores.masked_fill(~mask, -torch.inf): blocked entries receive zero softmax numerator, hence zero attention mass.

References:

Vaswani et al., “Attention Is All You Need”, 2017.
PyTorch scaled_dot_product_attention, whose mask interface is the runtime analogue of this spec. https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

Pointwise access #

The definitions are deliberately simple lower/upper-triangular Boolean tensors, so the access lemmas are definitional. Keeping them as named [simp] theorems lets larger attention proofs use the mask without unfolding the tensor constructors each time.

source

@[simp]

theorem NN.Proofs.Models.Attention.causalMask_get2 {n : ℕ} (i j : Fin n) :

Spec.get2 (Spec.causalMask n) i j = decide (↑j ≤ ↑i)

Reading causalMask n at row i, column j returns exactly j ≤ i.

source

@[simp]

theorem NN.Proofs.Models.Attention.futureMask_get2 {n : ℕ} (i j : Fin n) :

Spec.get2 (Spec.futureMask n) i j = decide (↑i < ↑j)

Reading futureMask n at row i, column j returns exactly i < j.

source

@[simp]

theorem NN.Proofs.Models.Attention.get2_map2Spec_matrix {α β γ : Type} {m n : ℕ} (f : α → β → γ) (A : Spec.Tensor α (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar))) (B : Spec.Tensor β (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar))) (i : Fin m) (j : Fin n) :

Spec.get2 (Spec.Tensor.map2Spec f A B) i j = f (Spec.get2 A i j) (Spec.get2 B i j)

Elementwise binary maps commute with matrix indexing.

This small tensor lemma is useful for attention proofs because masking is implemented as map2Spec over the score matrix and the Boolean mask.

Causal blocking and past visibility #

These are the two user-facing laws: causal attention blocks strict future columns and allows every past-or-present column.

source

theorem NN.Proofs.Models.Attention.causalMask_blocks_future {n : ℕ} (i j : Fin n) (hij : ↑i < ↑j) :

Spec.get2 (Spec.causalMask n) i j = false

A causal mask rejects every strict future key position.

source

theorem NN.Proofs.Models.Attention.causalMask_allows_past {n : ℕ} (i j : Fin n) (hji : ↑j ≤ ↑i) :

Spec.get2 (Spec.causalMask n) i j = true

A causal mask admits every past or current key position.

source

theorem NN.Proofs.Models.Attention.futureMask_marks_future {n : ℕ} (i j : Fin n) (hij : ↑i < ↑j) :

Spec.get2 (Spec.futureMask n) i j = true

A future mask is the strict complement direction of the causal lower triangle.

source

theorem NN.Proofs.Models.Attention.futureMask_rejects_past {n : ℕ} (i j : Fin n) (hji : ↑j ≤ ↑i) :

Spec.get2 (Spec.futureMask n) i j = false

A future mask rejects every past or current key position.

Exact hard-mask attention weights #

For hard masking, blocked entries are not merely assigned a very small logit. Their softmax numerator is definitionally zero. These lemmas are the attention-level facts needed for causal non-interference proofs.

source

theorem NN.Proofs.Models.Attention.hardMaskedSoftmaxVecSpec_blocked_eq_zero {n : ℕ} (scores : Spec.Tensor ℝ (Spec.Shape.dim n Spec.Shape.scalar)) (mask : Spec.Tensor Bool (Spec.Shape.dim n Spec.Shape.scalar)) (j : Fin n) (hblocked : mask.vecGet j = false) :

(Spec.hardMaskedSoftmaxVecSpec scores mask).vecGet j = 0

Any blocked coordinate of a hard-masked softmax vector has exactly zero weight.

source

theorem NN.Proofs.Models.Attention.hardMaskedSoftmaxSpec_blocked_eq_zero {nQ nK : ℕ} (scores : Spec.Tensor ℝ (Spec.Shape.dim nQ (Spec.Shape.dim nK Spec.Shape.scalar))) (mask : Spec.Tensor Bool (Spec.Shape.dim nQ (Spec.Shape.dim nK Spec.Shape.scalar))) (i : Fin nQ) (j : Fin nK) (hblocked : Spec.get2 mask i j = false) :

Spec.get2 (Spec.hardMaskedSoftmaxSpec scores mask) i j = 0

Any blocked coordinate of a row-wise hard-masked softmax matrix has exactly zero weight.

source

theorem NN.Proofs.Models.Attention.hardMaskedSoftmaxSpec_causal_future_zero {n : ℕ} (scores : Spec.Tensor ℝ (Spec.Shape.dim n (Spec.Shape.dim n Spec.Shape.scalar))) (i j : Fin n) (hij : ↑i < ↑j) :

Spec.get2 (Spec.hardMaskedSoftmaxSpec scores (Spec.causalMask n)) i j = 0

In exact hard-masked causal softmax, every strict-future attention weight is exactly zero.

TorchLean API

NN.Proofs.Models.Attention.CausalMask

Causal attention mask laws #

Pointwise access #

Causal blocking and past visibility #

Exact hard-mask attention weights #