MlpMse #
End-to-end analytic soundness for a first training target:
- 2-layer MLP (Linear → ReLU → Linear)
- MSE loss
We prove that the usual backprop formulas (including parameter gradients) coincide with the
adjoint of the Fréchet derivative (fderiv) over ℝ.
Notes:
- This is a spec-level (
ℝ) theorem. - ReLU is not differentiable at 0, so we assume a "no kinks" hypothesis on the pre-activation.
- Reverse-mode is naturally stated as a VJP theorem for vector outputs; scalar loss gradients are
obtained by choosing
δ = 1(equivalentlyVJP[loss, y] 1 = ∇ loss y).
PyTorch correspondence / citations #
- MLP building blocks:
torch.nn.Linear,torch.nn.functional.relu. https://pytorch.org/docs/stable/generated/torch.nn.Linear.html https://pytorch.org/docs/stable/generated/torch.nn.functional.relu.html - MSE loss reference behavior: https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html
- “Backward equals adjoint of derivative” is exactly the Jacobian-transpose theorem for PyTorch autograd: https://pytorch.org/docs/stable/autograd.html
Convert a matrix-shaped tensor W : Tensor ℝ (m×n) into the Hilbert-space matrix type Mat m n.
This is used for parameter-gradient proofs, where the parameter space is a Hilbert space with the Frobenius/L2 inner product.
Instances For
toMatE agrees with the coordinate-level view tensorToMatrix used by the FDeriv core.
Coordinate formula for reluDerivCLM: it scales each coordinate by relu'(xᵢ).
ReLU derivative map is self-adjoint w.r.t. the Euclidean inner product.
This is because it is a diagonal scaling map on ℝⁿ.
The adjoint of reluDerivCLM equals itself (self-adjoint operator).
The spec-level “input derivative” for a linear layer agrees with the Euclidean adjoint.
In words: the tensor expression for ∂(W x)/∂x applied to an upstream δ is Wᵀ δ, and this is
exactly the adjoint of the CLM x ↦ W x.
2-layer MLP in Euclidean Vec form, parameterized by Mat weights and Vec biases.
This is the same computation as NN.Proofs.Autograd.FDeriv.Core’s mlpVec, but set up for parameter-gradient
proofs where the parameter space is a Hilbert space.
Instances For
Convenience lemma: adjoint of the derivative of a scalar composition, applied to seed 1.
For scalar loss g ∘ f, this is the reverse-mode “chain rule” in adjoint form.
We name the intermediate activations so the gradient statements read like textbook backprop:
z1 (pre-activation), a1 (post-ReLU activation), and y (network output).
Fréchet derivative of the network output with respect to the second-layer weights W2.
Informally: ∂y/∂W2 is the linear map dW2 ↦ dW2 a1.
Closed-form gradient of scalar loss mse t (mlpVecMat …) with respect to W2.
Result: ∂L/∂W2 = (∂L/∂y) ⊗ a1, i.e. outer product of the output gradient and hidden activation.
Fréchet derivative of the network output with respect to the first-layer bias b1,
under the ReLU “no kinks” hypothesis.
Informally: ∂y/∂b1 = W2 ∘ ReLU'(z1).
Closed-form gradient of the scalar loss with respect to the first-layer bias b1
(under the ReLU “no kinks” hypothesis).
Fréchet derivative of the network output with respect to the input x,
under the ReLU “no kinks” hypothesis.
Informally: ∂y/∂x = W2 ∘ ReLU'(z1) ∘ W1.
Closed-form gradient of the scalar loss with respect to the input x,
under the ReLU “no kinks” hypothesis.
Fréchet derivative of the network output with respect to the first-layer weights W1,
under the ReLU “no kinks” hypothesis.
The derivative is linear in W1 through the slice dW1 ↦ dW1 x, then propagated through
ReLU' and W2.
Closed-form gradient of the scalar loss with respect to W1
(under the ReLU “no kinks” hypothesis).
Result has the expected “outer product” form with backpropagated hidden gradient and input x.