Spec-level gradient identities for the linear layer #
This file states (and in several cases, proves by definitional unfolding) the “obvious” gradient formulas for a linear layer:
y = W x + b
namely:
∂L/∂W = δ ⊗ x(outer product),∂L/∂x = Wᵀ δ(matrix-vector multiply), and∂L/∂b = δ.
What these theorems are (and are not) #
- These are spec-level identities over TorchLean’s tensor encodings, not a full calculus layer about Frechét derivatives.
- Several proofs are
rflafter unfolding definitions, because the corresponding specs are implemented directly in that form.
PyTorch correspondence / citations #
torch.nn.Linear/torch.nn.functional.linearimplementy = x Wᵀ + bwith weight stored as shape(out_features, in_features)(so the math “matrix” isWwith output rows). TorchLean’sLinearSpecfollows the same convention:weights : Tensor α (.dim outDim (.dim inDim .scalar)). https://pytorch.org/docs/stable/generated/torch.nn.Linear.html https://pytorch.org/docs/stable/generated/torch.nn.functional.linear.html- The “outer product” view of the weight gradient corresponds to the common vector formula
grad_W = δ ⊗ x(PyTorch hastorch.outerfor vectors). https://pytorch.org/docs/stable/generated/torch.outer.html
Why keep this file #
Even when proofs are definitional, having them recorded explicitly helps:
- document the intended math semantics of the “backward specs”,
- provide simple regression checks when refactoring tensor encodings, and
- serve as stepping stones for the more advanced autograd soundness proofs in
NN/Proofs/Autograd/*.
References #
- Standard matrix calculus / backpropagation identities; no single source is required.
Spec identity: weight gradient for a linear layer.
For y = W x + b, if δ = ∂L/∂y then the weight gradient is
∂L/∂W = δ ⊗ x.
PyTorch mental model: this is the per-sample formula whose batched version becomes a matmul against the input batch.
Spec identity: input gradient for a linear layer.
For y = W x + b, the input gradient is
∂L/∂x = Wᵀ δ.
Spec identity: bias gradient for a linear layer.
For y = W x + b, the bias gradient is ∂L/∂b = δ.
Shape/typing sanity check: all backward specs return tensors of the expected shapes.
This is “free” from Lean’s dependent types, but it is sometimes convenient as a lemma when writing documentation-style proofs.
Mathematical correctness theorem: Linear layer gradients satisfy the chain rule. This formalizes the core mathematical property validating backward implementation.