MlpMse #

End-to-end analytic soundness for a first training target:

2-layer MLP (Linear → ReLU → Linear)
MSE loss

We prove that the usual backprop formulas (including parameter gradients) coincide with the adjoint of the Fréchet derivative (fderiv) over ℝ.

Notes:

This is a spec-level (ℝ) theorem.
ReLU is not differentiable at 0, so we assume a "no kinks" hypothesis on the pre-activation.
Reverse-mode is naturally stated as a VJP theorem for vector outputs; scalar loss gradients are obtained by choosing δ = 1 (equivalently VJP[loss, y] 1 = ∇ loss y).

PyTorch correspondence / citations #

MLP building blocks: torch.nn.Linear, torch.nn.functional.relu. https://pytorch.org/docs/stable/generated/torch.nn.Linear.html https://pytorch.org/docs/stable/generated/torch.nn.functional.relu.html
MSE loss reference behavior: https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html
“Backward equals adjoint of derivative” is exactly the Jacobian-transpose theorem for PyTorch autograd: https://pytorch.org/docs/stable/autograd.html

source

def Proofs.Autograd.toMatE {m n : ℕ} (W : Spec.Tensor ℝ (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar))) :

Mat m n

Convert a matrix-shaped tensor W : Tensor ℝ (m×n) into the Hilbert-space matrix type Mat m n.

This is used for parameter-gradient proofs, where the parameter space is a Hilbert space with the Frobenius/L2 inner product.

Instances For

source

theorem Proofs.Autograd.toMatrix_toMatE {m n : ℕ} (W : Spec.Tensor ℝ (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar))) :

toMatrix (toMatE W) = tensorToMatrix W

toMatE agrees with the coordinate-level view tensorToMatrix used by the FDeriv core.

source

theorem Proofs.Autograd.reluDerivCLM_apply {n : ℕ} (x dx : Vec n) (i : Fin n) :

((reluDerivCLM x) dx).ofLp i = dx.ofLp i * Activation.Math.reluDerivSpec (x.ofLp i)

Coordinate formula for reluDerivCLM: it scales each coordinate by relu'(xᵢ).

source

theorem Proofs.Autograd.reluDerivCLM_inner {n : ℕ} (x dx δ : Vec n) :

inner ℝ ((reluDerivCLM x) dx) δ = inner ℝ dx ((reluDerivCLM x) δ)

ReLU derivative map is self-adjoint w.r.t. the Euclidean inner product.

This is because it is a diagonal scaling map on ℝⁿ.

source

theorem Proofs.Autograd.reluDerivCLM_adjoint_apply {n : ℕ} (x δ : Vec n) :

(ContinuousLinearMap.adjoint (reluDerivCLM x)) δ = (reluDerivCLM x) δ

The adjoint of reluDerivCLM equals itself (self-adjoint operator).

source

theorem Proofs.Autograd.toVecE_linear_input_deriv_spec_eq_adjoint {inDim outDim : ℕ} (W : Spec.Tensor ℝ (Spec.Shape.dim outDim (Spec.Shape.dim inDim Spec.Shape.scalar))) (δ : Spec.Tensor ℝ (Spec.Shape.dim outDim Spec.Shape.scalar)) :

toVecE (Spec.linearInputDerivSpec W δ) = (ContinuousLinearMap.adjoint (matCLM (tensorToMatrix W))) (toVecE δ)

The spec-level “input derivative” for a linear layer agrees with the Euclidean adjoint.

In words: the tensor expression for ∂(W x)/∂x applied to an upstream δ is Wᵀ δ, and this is exactly the adjoint of the CLM x ↦ W x.

source

noncomputable def Proofs.Autograd.affineMat {inDim outDim : ℕ} (W : Mat outDim inDim) (b : Vec outDim) :

Vec inDim → Vec outDim

Affine map using Mat parameters (same as affine, but with Mat instead of Matrix).

Instances For

source

noncomputable def Proofs.Autograd.mlpVecMat {inDim hidDim outDim : ℕ} (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) :

Vec inDim → Vec outDim

2-layer MLP in Euclidean Vec form, parameterized by Mat weights and Vec biases.

This is the same computation as NN.Proofs.Autograd.FDeriv.Core’s mlpVec, but set up for parameter-gradient proofs where the parameter space is a Hilbert space.

Instances For

source

noncomputable def Proofs.Autograd.mse {n : ℕ} (t : Vec n) :

Vec n → ℝ

Mean-squared error loss (MSE) against a fixed target t:

mse t y = (1/n) * ‖y - t‖².

Instances For

source

noncomputable def Proofs.Autograd.mseGrad {n : ℕ} (y t : Vec n) :

Vec n

Gradient of MSE with respect to y:

∇_y mse(t)(y) = (2/n) * (y - t).

Instances For

source

theorem Proofs.Autograd.hasFDerivAt_mse {n : ℕ} (t y : Vec n) :

HasFDerivAt (mse t) ((2 / ↑n) • (innerSL ℝ) (y - t)) y

Fréchet derivative of MSE, packaged as a continuous linear map Vec n →L ℝ.

source

theorem Proofs.Autograd.mseGrad_eq_adjoint_fderiv {n : ℕ} (t y : Vec n) :

(vjp (mse t) y) 1 = mseGrad y t

The VJP of MSE at y with upstream seed 1 equals the usual gradient mseGrad y t.

This is the scalar-loss specialization: for scalar loss ℓ, the gradient is (fderiv ℓ)† 1.

source

theorem Proofs.Autograd.adjoint_fderiv_comp_apply_one {E F : Type} [NormedAddCommGroup E] [InnerProductSpace ℝ E] [CompleteSpace E] [NormedAddCommGroup F] [InnerProductSpace ℝ F] [CompleteSpace F] {f : E → F} {g : F → ℝ} {f' : E →L[ℝ ] F} {g' : F →L[ℝ ] ℝ} {x : E} (hf : HasFDerivAt f f' x) (hg : HasFDerivAt g g' (f x)) :

(ContinuousLinearMap.adjoint (fderiv ℝ (fun (x : E) => g (f x)) x)) 1 = (ContinuousLinearMap.adjoint f') ((ContinuousLinearMap.adjoint g') 1)

Convenience lemma: adjoint of the derivative of a scalar composition, applied to seed 1.

For scalar loss g ∘ f, this is the reverse-mode “chain rule” in adjoint form.

We name the intermediate activations so the gradient statements read like textbook backprop: z1 (pre-activation), a1 (post-ReLU activation), and y (network output).

source

noncomputable def Proofs.Autograd.z1 {inDim hidDim : ℕ} (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (x : Vec inDim) :

Vec hidDim

Pre-activation (used for the ReLU differentiability hypothesis).

Instances For

source

noncomputable def Proofs.Autograd.a1 {inDim hidDim : ℕ} (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (x : Vec inDim) :

Vec hidDim

Hidden activation a1 = relu(z1).

Instances For

source

noncomputable def Proofs.Autograd.y {inDim hidDim outDim : ℕ} (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) :

Vec outDim

Network output y = mlpVecMat W1 b1 W2 b2 x.

Instances For

source

theorem Proofs.Autograd.hasFDerivAt_mlp_wrt_W2 {inDim hidDim outDim : ℕ} (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) :

HasFDerivAt (fun (W2 : Mat outDim hidDim) => mlpVecMat W1 b1 W2 b2 x) (matApplyLin (a1 W1 b1 x)) W2

Fréchet derivative of the network output with respect to the second-layer weights W2.

Informally: ∂y/∂W2 is the linear map dW2 ↦ dW2 a1.

source

theorem Proofs.Autograd.grad_W2_mse {inDim hidDim outDim : ℕ} (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (t : Vec outDim) :

(ContinuousLinearMap.adjoint (fderiv ℝ (fun (W2 : Mat outDim hidDim) => mse t (mlpVecMat W1 b1 W2 b2 x)) W2)) 1 = outer (mseGrad (y W1 b1 W2 b2 x) t) (a1 W1 b1 x)

Closed-form gradient of scalar loss mse t (mlpVecMat …) with respect to W2.

Result: ∂L/∂W2 = (∂L/∂y) ⊗ a1, i.e. outer product of the output gradient and hidden activation.

source

theorem Proofs.Autograd.grad_b2_mse {inDim hidDim outDim : ℕ} (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (t : Vec outDim) :

(ContinuousLinearMap.adjoint (fderiv ℝ (fun (b2 : Vec outDim) => mse t (mlpVecMat W1 b1 W2 b2 x)) b2)) 1 = mseGrad (mlpVecMat W1 b1 W2 b2 x) t

Closed-form gradient of the scalar loss with respect to the second-layer bias b2.

Result: ∂L/∂b2 = ∂L/∂y.

source

theorem Proofs.Autograd.hasFDerivAt_mlp_wrt_b1 {inDim hidDim outDim : ℕ} (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (hx : ∀ (i : Fin hidDim), (z1 W1 b1 x).ofLp i ≠ 0) :

HasFDerivAt (fun (b1 : Vec hidDim) => mlpVecMat W1 b1 W2 b2 x) ((matCLM (toMatrix W2)).comp (reluDerivCLM (z1 W1 b1 x))) b1

Fréchet derivative of the network output with respect to the first-layer bias b1, under the ReLU “no kinks” hypothesis.

Informally: ∂y/∂b1 = W2 ∘ ReLU'(z1).

source

theorem Proofs.Autograd.grad_b1_mse {inDim hidDim outDim : ℕ} (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (t : Vec outDim) (hx : ∀ (i : Fin hidDim), (z1 W1 b1 x).ofLp i ≠ 0) :

(ContinuousLinearMap.adjoint (fderiv ℝ (fun (b1 : Vec hidDim) => mse t (mlpVecMat W1 b1 W2 b2 x)) b1)) 1 = (reluDerivCLM (z1 W1 b1 x)) ((ContinuousLinearMap.adjoint (matCLM (toMatrix W2))) (mseGrad (y W1 b1 W2 b2 x) t))

Closed-form gradient of the scalar loss with respect to the first-layer bias b1 (under the ReLU “no kinks” hypothesis).

source

theorem Proofs.Autograd.hasFDerivAt_mlp_wrt_x {inDim hidDim outDim : ℕ} (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (hx : ∀ (i : Fin hidDim), (z1 W1 b1 x).ofLp i ≠ 0) :

HasFDerivAt (mlpVecMat W1 b1 W2 b2) ((matCLM (toMatrix W2)).comp ((reluDerivCLM (z1 W1 b1 x)).comp (matCLM (toMatrix W1)))) x

Fréchet derivative of the network output with respect to the input x, under the ReLU “no kinks” hypothesis.

Informally: ∂y/∂x = W2 ∘ ReLU'(z1) ∘ W1.

source

theorem Proofs.Autograd.grad_x_mse {inDim hidDim outDim : ℕ} (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (t : Vec outDim) (hx : ∀ (i : Fin hidDim), (z1 W1 b1 x).ofLp i ≠ 0) :

(ContinuousLinearMap.adjoint (fderiv ℝ (fun (x : Vec inDim) => mse t (mlpVecMat W1 b1 W2 b2 x)) x)) 1 = (ContinuousLinearMap.adjoint (matCLM (toMatrix W1))) ((reluDerivCLM (z1 W1 b1 x)) ((ContinuousLinearMap.adjoint (matCLM (toMatrix W2))) (mseGrad (y W1 b1 W2 b2 x) t)))

Closed-form gradient of the scalar loss with respect to the input x, under the ReLU “no kinks” hypothesis.

source

theorem Proofs.Autograd.hasFDerivAt_mlp_wrt_W1 {inDim hidDim outDim : ℕ} (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (hx : ∀ (i : Fin hidDim), (z1 W1 b1 x).ofLp i ≠ 0) :

HasFDerivAt (fun (W1 : Mat hidDim inDim) => mlpVecMat W1 b1 W2 b2 x) ((matCLM (toMatrix W2)).comp ((reluDerivCLM (z1 W1 b1 x)).comp (matApplyLin x))) W1

Fréchet derivative of the network output with respect to the first-layer weights W1, under the ReLU “no kinks” hypothesis.

The derivative is linear in W1 through the slice dW1 ↦ dW1 x, then propagated through ReLU' and W2.

source

theorem Proofs.Autograd.grad_W1_mse {inDim hidDim outDim : ℕ} (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (t : Vec outDim) (hx : ∀ (i : Fin hidDim), (z1 W1 b1 x).ofLp i ≠ 0) :

(ContinuousLinearMap.adjoint (fderiv ℝ (fun (W1 : Mat hidDim inDim) => mse t (mlpVecMat W1 b1 W2 b2 x)) W1)) 1 = outer ((reluDerivCLM (z1 W1 b1 x)) ((ContinuousLinearMap.adjoint (matCLM (toMatrix W2))) (mseGrad (y W1 b1 W2 b2 x) t))) x

Closed-form gradient of the scalar loss with respect to W1 (under the ReLU “no kinks” hypothesis).

Result has the expected “outer product” form with backpropagated hidden gradient and input x.

TorchLean API

NN.Proofs.Autograd.FDeriv.MlpMse

MlpMse #

PyTorch correspondence / citations #