TorchLean API

NN.Proofs.Autograd.FDeriv.MlpMse

MlpMse #

End-to-end analytic soundness for a first training target:

We prove that the usual backprop formulas (including parameter gradients) coincide with the adjoint of the Fréchet derivative (fderiv) over .

Notes:

PyTorch correspondence / citations #

Convert a matrix-shaped tensor W : Tensor ℝ (m×n) into the Hilbert-space matrix type Mat m n.

This is used for parameter-gradient proofs, where the parameter space is a Hilbert space with the Frobenius/L2 inner product.

Instances For

    toMatE agrees with the coordinate-level view tensorToMatrix used by the FDeriv core.

    theorem Proofs.Autograd.reluDerivCLM_apply {n : } (x dx : Vec n) (i : Fin n) :

    Coordinate formula for reluDerivCLM: it scales each coordinate by relu'(xᵢ).

    theorem Proofs.Autograd.reluDerivCLM_inner {n : } (x dx δ : Vec n) :
    inner ((reluDerivCLM x) dx) δ = inner dx ((reluDerivCLM x) δ)

    ReLU derivative map is self-adjoint w.r.t. the Euclidean inner product.

    This is because it is a diagonal scaling map on ℝⁿ.

    The adjoint of reluDerivCLM equals itself (self-adjoint operator).

    The spec-level “input derivative” for a linear layer agrees with the Euclidean adjoint.

    In words: the tensor expression for ∂(W x)/∂x applied to an upstream δ is Wᵀ δ, and this is exactly the adjoint of the CLM x ↦ W x.

    noncomputable def Proofs.Autograd.affineMat {inDim outDim : } (W : Mat outDim inDim) (b : Vec outDim) :
    Vec inDimVec outDim

    Affine map using Mat parameters (same as affine, but with Mat instead of Matrix).

    Instances For
      noncomputable def Proofs.Autograd.mlpVecMat {inDim hidDim outDim : } (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) :
      Vec inDimVec outDim

      2-layer MLP in Euclidean Vec form, parameterized by Mat weights and Vec biases.

      This is the same computation as NN.Proofs.Autograd.FDeriv.Core’s mlpVec, but set up for parameter-gradient proofs where the parameter space is a Hilbert space.

      Instances For
        noncomputable def Proofs.Autograd.mse {n : } (t : Vec n) :
        Vec n

        Mean-squared error loss (MSE) against a fixed target t:

        mse t y = (1/n) * ‖y - t‖².

        Instances For
          noncomputable def Proofs.Autograd.mseGrad {n : } (y t : Vec n) :
          Vec n

          Gradient of MSE with respect to y:

          ∇_y mse(t)(y) = (2/n) * (y - t).

          Instances For
            theorem Proofs.Autograd.hasFDerivAt_mse {n : } (t y : Vec n) :
            HasFDerivAt (mse t) ((2 / n) (innerSL ) (y - t)) y

            Fréchet derivative of MSE, packaged as a continuous linear map Vec n →L ℝ.

            theorem Proofs.Autograd.mseGrad_eq_adjoint_fderiv {n : } (t y : Vec n) :
            (vjp (mse t) y) 1 = mseGrad y t

            The VJP of MSE at y with upstream seed 1 equals the usual gradient mseGrad y t.

            This is the scalar-loss specialization: for scalar loss , the gradient is (fderiv ℓ)† 1.

            theorem Proofs.Autograd.adjoint_fderiv_comp_apply_one {E F : Type} [NormedAddCommGroup E] [InnerProductSpace E] [CompleteSpace E] [NormedAddCommGroup F] [InnerProductSpace F] [CompleteSpace F] {f : EF} {g : F} {f' : E →L[] F} {g' : F →L[] } {x : E} (hf : HasFDerivAt f f' x) (hg : HasFDerivAt g g' (f x)) :

            Convenience lemma: adjoint of the derivative of a scalar composition, applied to seed 1.

            For scalar loss g ∘ f, this is the reverse-mode “chain rule” in adjoint form.

            We name the intermediate activations so the gradient statements read like textbook backprop: z1 (pre-activation), a1 (post-ReLU activation), and y (network output).

            noncomputable def Proofs.Autograd.z1 {inDim hidDim : } (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (x : Vec inDim) :
            Vec hidDim

            Pre-activation (used for the ReLU differentiability hypothesis).

            Instances For
              noncomputable def Proofs.Autograd.a1 {inDim hidDim : } (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (x : Vec inDim) :
              Vec hidDim

              Hidden activation a1 = relu(z1).

              Instances For
                noncomputable def Proofs.Autograd.y {inDim hidDim outDim : } (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) :
                Vec outDim

                Network output y = mlpVecMat W1 b1 W2 b2 x.

                Instances For
                  theorem Proofs.Autograd.hasFDerivAt_mlp_wrt_W2 {inDim hidDim outDim : } (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) :
                  HasFDerivAt (fun (W2 : Mat outDim hidDim) => mlpVecMat W1 b1 W2 b2 x) (matApplyLin (a1 W1 b1 x)) W2

                  Fréchet derivative of the network output with respect to the second-layer weights W2.

                  Informally: ∂y/∂W2 is the linear map dW2 ↦ dW2 a1.

                  theorem Proofs.Autograd.grad_W2_mse {inDim hidDim outDim : } (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (t : Vec outDim) :
                  (ContinuousLinearMap.adjoint (fderiv (fun (W2 : Mat outDim hidDim) => mse t (mlpVecMat W1 b1 W2 b2 x)) W2)) 1 = outer (mseGrad (y W1 b1 W2 b2 x) t) (a1 W1 b1 x)

                  Closed-form gradient of scalar loss mse t (mlpVecMat …) with respect to W2.

                  Result: ∂L/∂W2 = (∂L/∂y) ⊗ a1, i.e. outer product of the output gradient and hidden activation.

                  theorem Proofs.Autograd.grad_b2_mse {inDim hidDim outDim : } (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (t : Vec outDim) :
                  (ContinuousLinearMap.adjoint (fderiv (fun (b2 : Vec outDim) => mse t (mlpVecMat W1 b1 W2 b2 x)) b2)) 1 = mseGrad (mlpVecMat W1 b1 W2 b2 x) t

                  Closed-form gradient of the scalar loss with respect to the second-layer bias b2.

                  Result: ∂L/∂b2 = ∂L/∂y.

                  theorem Proofs.Autograd.hasFDerivAt_mlp_wrt_b1 {inDim hidDim outDim : } (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (hx : ∀ (i : Fin hidDim), (z1 W1 b1 x).ofLp i 0) :
                  HasFDerivAt (fun (b1 : Vec hidDim) => mlpVecMat W1 b1 W2 b2 x) ((matCLM (toMatrix W2)).comp (reluDerivCLM (z1 W1 b1 x))) b1

                  Fréchet derivative of the network output with respect to the first-layer bias b1, under the ReLU “no kinks” hypothesis.

                  Informally: ∂y/∂b1 = W2 ∘ ReLU'(z1).

                  theorem Proofs.Autograd.grad_b1_mse {inDim hidDim outDim : } (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (t : Vec outDim) (hx : ∀ (i : Fin hidDim), (z1 W1 b1 x).ofLp i 0) :
                  (ContinuousLinearMap.adjoint (fderiv (fun (b1 : Vec hidDim) => mse t (mlpVecMat W1 b1 W2 b2 x)) b1)) 1 = (reluDerivCLM (z1 W1 b1 x)) ((ContinuousLinearMap.adjoint (matCLM (toMatrix W2))) (mseGrad (y W1 b1 W2 b2 x) t))

                  Closed-form gradient of the scalar loss with respect to the first-layer bias b1 (under the ReLU “no kinks” hypothesis).

                  theorem Proofs.Autograd.hasFDerivAt_mlp_wrt_x {inDim hidDim outDim : } (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (hx : ∀ (i : Fin hidDim), (z1 W1 b1 x).ofLp i 0) :
                  HasFDerivAt (mlpVecMat W1 b1 W2 b2) ((matCLM (toMatrix W2)).comp ((reluDerivCLM (z1 W1 b1 x)).comp (matCLM (toMatrix W1)))) x

                  Fréchet derivative of the network output with respect to the input x, under the ReLU “no kinks” hypothesis.

                  Informally: ∂y/∂x = W2 ∘ ReLU'(z1) ∘ W1.

                  theorem Proofs.Autograd.grad_x_mse {inDim hidDim outDim : } (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (t : Vec outDim) (hx : ∀ (i : Fin hidDim), (z1 W1 b1 x).ofLp i 0) :
                  (ContinuousLinearMap.adjoint (fderiv (fun (x : Vec inDim) => mse t (mlpVecMat W1 b1 W2 b2 x)) x)) 1 = (ContinuousLinearMap.adjoint (matCLM (toMatrix W1))) ((reluDerivCLM (z1 W1 b1 x)) ((ContinuousLinearMap.adjoint (matCLM (toMatrix W2))) (mseGrad (y W1 b1 W2 b2 x) t)))

                  Closed-form gradient of the scalar loss with respect to the input x, under the ReLU “no kinks” hypothesis.

                  theorem Proofs.Autograd.hasFDerivAt_mlp_wrt_W1 {inDim hidDim outDim : } (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (hx : ∀ (i : Fin hidDim), (z1 W1 b1 x).ofLp i 0) :
                  HasFDerivAt (fun (W1 : Mat hidDim inDim) => mlpVecMat W1 b1 W2 b2 x) ((matCLM (toMatrix W2)).comp ((reluDerivCLM (z1 W1 b1 x)).comp (matApplyLin x))) W1

                  Fréchet derivative of the network output with respect to the first-layer weights W1, under the ReLU “no kinks” hypothesis.

                  The derivative is linear in W1 through the slice dW1 ↦ dW1 x, then propagated through ReLU' and W2.

                  theorem Proofs.Autograd.grad_W1_mse {inDim hidDim outDim : } (W1 : Mat hidDim inDim) (b1 : Vec hidDim) (W2 : Mat outDim hidDim) (b2 : Vec outDim) (x : Vec inDim) (t : Vec outDim) (hx : ∀ (i : Fin hidDim), (z1 W1 b1 x).ofLp i 0) :
                  (ContinuousLinearMap.adjoint (fderiv (fun (W1 : Mat hidDim inDim) => mse t (mlpVecMat W1 b1 W2 b2 x)) W1)) 1 = outer ((reluDerivCLM (z1 W1 b1 x)) ((ContinuousLinearMap.adjoint (matCLM (toMatrix W2))) (mseGrad (y W1 b1 W2 b2 x) t))) x

                  Closed-form gradient of the scalar loss with respect to W1 (under the ReLU “no kinks” hypothesis).

                  Result has the expected “outer product” form with backpropagated hidden gradient and input x.