BugZoo: attention-mask semantics #

Attention code has its own failure modes: mask polarity, head reshaping, Q/K/V layout, and KV-cache mismatches are easy to get wrong and hard to notice from accuracy tests alone. This file focuses on the mask part, because TorchLean already has a precise theorem stack for it.

Here is the bug-shaped PyTorch pattern we want to rule out:

# Wrong for causal attention: the polarity is flipped, so future tokens are allowed.
scores = q @ k.transpose(-2, -1) / math.sqrt(d)
mask = torch.triu(torch.ones(T, T, dtype=torch.bool), diagonal=1)
weights = torch.softmax(scores.masked_fill(mask == False, -torch.inf), dim=-1)

The intended PyTorch version uses true negative infinity on blocked future entries:

future = torch.triu(torch.ones(T, T, dtype=torch.bool), diagonal=1)
weights = torch.softmax(scores.masked_fill(future, -torch.inf), dim=-1)
assert weights[i, j] == 0.0 for all j > i

Lean's ordinary ℝ does not contain a literal -∞, but mathlib does provide extended reals EReal, where ⊥ is negative infinity and EReal.exp ⊥ = 0. We record that exact -∞ fact first. TorchLean's ordinary tensor softmax then uses the computationally convenient equivalent: blocked logits get zero numerator before normalization. Both views lead to the exact theorem below.

References:

Vaswani et al., “Attention Is All You Need”, NeurIPS 2017. https://arxiv.org/abs/1706.03762
PyTorch scaled_dot_product_attention documentation, for the runtime-style mask interface: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
PyTorch issue #99282, where MultiheadAttention(is_causal=True) was reported ignored when need_weights=True: https://github.com/pytorch/pytorch/issues/99282
PyTorch issue #160064, where fully masked attention heads were reported to produce NaNs when attention weights were requested: https://github.com/pytorch/pytorch/issues/160064

source

noncomputable def NN.Examples.BugZoo.AttentionMask.exactMaskedLogit (score : ℝ) (allowed : Bool) :

EReal

Exact extended-real masked logit: allowed entries keep their real score, blocked entries are literal -∞ (⊥ : EReal).

Instances For

source

@[simp]

theorem NN.Examples.BugZoo.AttentionMask.exactMaskedLogit_blocked (score : ℝ) :

exactMaskedLogit score false = ⊥

Blocking a logit really means assigning -∞ in the extended-real presentation.

source

@[simp]

theorem NN.Examples.BugZoo.AttentionMask.exactMaskedLogit_blocked_exp_zero (score : ℝ) :

(exactMaskedLogit score false).exp = 0

The key -∞ softmax fact: exp(-∞) = 0.

source

noncomputable def NN.Examples.BugZoo.AttentionMask.exactCausalMaskedScore {n : ℕ} (scores : Spec.Tensor ℝ (Spec.Shape.dim n (Spec.Shape.dim n Spec.Shape.scalar))) (i j : Fin n) :

EReal

Exact extended-real causal masking of one score-matrix coordinate.

Instances For

source

theorem NN.Examples.BugZoo.AttentionMask.exactCausalMaskedScore_future_eq_bot {n : ℕ} (scores : Spec.Tensor ℝ (Spec.Shape.dim n (Spec.Shape.dim n Spec.Shape.scalar))) (i j : Fin n) (hij : ↑i < ↑j) :

exactCausalMaskedScore scores i j = ⊥

For a strict-future position, exact causal masking assigns literal -∞.

This is the formal version of the PyTorch operation scores.masked_fill(future, -torch.inf) at one matrix coordinate.

source

theorem NN.Examples.BugZoo.AttentionMask.exactCausalMaskedScore_future_exp_zero {n : ℕ} (scores : Spec.Tensor ℝ (Spec.Shape.dim n (Spec.Shape.dim n Spec.Shape.scalar))) (i j : Fin n) (hij : ↑i < ↑j) :

(exactCausalMaskedScore scores i j).exp = 0

Therefore, the strict-future numerator is exactly zero.

This is why TorchLean's attention spec writes this zero numerator directly.

source

theorem NN.Examples.BugZoo.AttentionMask.trueInfinityMask_future_attention_weight_zero {n : ℕ} (scores : Spec.Tensor ℝ (Spec.Shape.dim n (Spec.Shape.dim n Spec.Shape.scalar))) (i j : Fin n) (hij : ↑i < ↑j) :

Spec.get2 (Spec.hardMaskedSoftmaxSpec scores (Spec.causalMask n)) i j = 0

True--∞ causal attention gets the exact zero-weight theorem.

This is the statement we want for formal output-causality arguments: every strict-future key has zero attention mass for the current query row. In TorchLean this is represented by hardMaskedSoftmaxSpec, not by a finite real sentinel treated as -∞.