Permutation Equivariance of Self-Attention (No Positional Encoding) #
Self-attention (without positional information) should be equivariant to permutations of the token axis:
If we reorder the input tokens, the output is reordered in the same way.
This file formalizes that statement for TorchLean’s spec-layer Spec.selfAttention over ℝ.
The helper reindexing operations below are intentionally proof-local. They describe how this proof
permutes tensor axes, but they are not part of the general Spec.Tensor API; reusable tensor
operations should live under NN.Spec, while model theorems and their proof scaffolding live here.
Token reindexing #
Because spec tensors are functions out of Fin n, a token permutation is just reindexing the
outer axis.
Softmax equivariance #
TorchLean's spec softmax on vectors is implemented in a stabilized way (x ↦ exp(x - m) / Σ exp(x - m)),
but over ℝ it agrees with the plain exp(x) / Σ exp(x) formula. This lets us prove permutation
equivariance without reasoning about how the stabilizing shift m is chosen.