TorchLean API

NN.Runtime.Autograd.Engine.FastKernels

FastKernels #

Fast (runtime-only) kernels for the eager autograd tape.

This file is not used by the proof-linked compilation path. Instead, it provides drop-in alternative Tape.* constructors for a few hot ops that are performance bottlenecks when evaluated via the spec-layer definitions (which are written for proofs/clarity).

Current scope:

These are enabled from the PyTorch-style front-end (NN/Runtime/Autograd/Torch/Core.lean) via an opt-in flag Torch.Options.fastKernels.

Precision selector for GPU-backed fast matmul over Lean Float tensors.

  • .fp32 routes through Cuda.Buffer and cuBLAS SGEMM, matching the precision used by the eager CUDA tensor-buffer stack.
  • .fp64 routes through the host FloatArray DGEMM bridge and cuBLAS DGEMM, preserving Lean Float precision for matmul-only research paths.
Instances For

    Convert a length-n vector tensor to an Array α.

    This is a small runtime helper used by the fast kernels to run tight loops (for i in [0:n]). It is safe because the tensor shape carries n, and the result is constructed via Array.ofFn.

    Instances For

      Convert an (m×n) matrix tensor into an array-of-rows representation.

      This is purely a representation change to make runtime loops faster/easier to write.

      Instances For
        def Runtime.Autograd.FastKernels.dot {α : Type} [Inhabited α] [Add α] [Mul α] [Zero α] (n : ) (a b : Array α) :
        α

        Dot product of two length-n arrays.

        The fast kernels build arrays using Array.ofFn, so both inputs have the right size and get! is safe in this context.

        Instances For

          linearForward computes the affine layer forward pass:

          y = W x + b

          for a 2D weight matrix W : (outDim × inDim) and 1D vectors x : inDim, b : outDim.

          This is a runtime-only kernel intended for performance; compare the spec-layer definition, which is optimized for proofs/clarity.

          Instances For

            linearBackward returns the standard affine-layer gradients.

            If y = W x + b and the upstream gradient is g = dL/dy, then:

            • dL/dW = g ⊗ x (outer product), i.e. dW[i,k] = g[i] * x[k]
            • dL/db = g
            • dL/dx = Wᵀ g, i.e. dx[k] = Σ_i W[i,k] * g[i]

            These formulas match the usual backprop rule used by PyTorch autograd for torch.nn.Linear. See PyTorch "Autograd mechanics": https://pytorch.org/docs/stable/notes/autograd.html

            Instances For

              View a length-n vector as an (n × 1) matrix (one column, row-major).

              Instances For

                View a length-n vector as a (1 × n) row matrix.

                Instances For

                  Drop the trivial second dimension (n × 1) → (n).

                  Instances For
                    def Runtime.Autograd.FastKernels.mseSpecVec {α : Type} [Inhabited α] [Add α] [Sub α] [Mul α] [Div α] [Zero α] [Coe α] {n : } (yhat target : Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)) :
                    α

                    Vector mean-squared error (MSE) with "mean" reduction:

                    mse(yhat, target) = (Σ_i (yhat_i - target_i)^2) / n.

                    This corresponds to torch.nn.functional.mse_loss(..., reduction="mean") for a 1D tensor. Ref: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html

                    Instances For

                      Gradient of mseSpecVec with respect to yhat.

                      If mse = (Σ_i (yhat_i - target_i)^2) / n, then:

                      ∂mse/∂yhat_i = 2 * (yhat_i - target_i) / n.

                      Instances For

                        Fast (runtime-only) 2D matmul kernel.

                        This is a tight-loop kernel (array-of-rows representation) intended to avoid the overhead of the spec-layer definitions when running eager autograd.

                        Instances For

                          Fast (runtime-only) matmul dispatch.

                          This is used by the eager autograd tape fast-kernel variant Tape.matmul_fast. The useGpu flag is treated as a hint: the default instance ignores it, while the Float instance can route to CUDA.

                          Instances
                            @[implicit_reducible, instance 10]

                            Default FastMatmul: always use the CPU fast-kernel implementation.

                            2D matmul forward via the float32 CUDA buffer stack.

                            This path uploads Lean Float values to Cuda.Buffer (rounding to float32), calls the existing Buffer.bmm SGEMM implementation with batch = 1, then downloads the float32 result back to Lean Float.

                            Instances For
                              @[reducible, inline]

                              Default GPU matmul alias: the FP64 cuBLAS path for Lean Float tensors.

                              Instances For
                                @[implicit_reducible]

                                Float instance: use an explicit cuBLAS FP32/FP64 path when useGpu=true.

                                Fast affine forward y = W x + b using a single 2D matmul (W is (outDim × inDim), x as (inDim × 1)), then add_spec for the bias.

                                When useGpu is true and α = Float, FastMatmul routes the multiply to the requested cuBLAS precision (if built with -K cuda=true); otherwise the same path uses the CPU tight-loop matmul.

                                Instances For

                                  Fast affine backward using three matmul-shaped operations (two GEMMs + bias copy).

                                  Matches the usual Linear autograd: dW = g xᵀ, db = g, dx = Wᵀ g.

                                  Instances For

                                    Fast (runtime-only) matmul backward rule computed via two matmuls and transposes.

                                    If y = a*b and g = dL/dy, then:

                                    • dL/da = g * bᵀ
                                    • dL/db = aᵀ * g
                                    Instances For
                                      def Runtime.Autograd.Tape.linearFast {α : Type} [Add α] [Mul α] [Zero α] [DecidableEq Spec.Shape] [FastKernels.FastMatmul α] {inDim outDim : } (useGpu : Bool) (gpuPrecision : FastKernels.GpuMatmulPrecision := FastKernels.GpuMatmulPrecision.fp32) (t : Tape α) (wId bId xId : ) :

                                      Fast runtime variant of Tape.linear.

                                      This is intended as a drop-in replacement for the affine layer y = W x + b when the shapes are exactly W : (outDim×inDim), b : outDim, x : inDim. It implements the multiply via FastMatmul (one GEMM-shaped matmul + bias), so Torch.Options.useGpu can route Float matmuls to cuBLAS when enabled.

                                      PyTorch comparison: torch.nn.Linear / torch.nn.functional.linear. Reference: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html

                                      Instances For
                                        def Runtime.Autograd.Tape.matmulFast {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] [FastKernels.FastMatmul α] {m n p : } (useGpu : Bool) (gpuPrecision : FastKernels.GpuMatmulPrecision := FastKernels.GpuMatmulPrecision.fp32) (t : Tape α) (aId bId : ) :

                                        Fast runtime variant of Tape.matmul for 2D tensors.

                                        This is a runtime-only optimization: it uses the eager fast-kernel implementation FastKernels.matmulForward rather than the spec-layer definition.

                                        Instances For
                                          def Runtime.Autograd.Tape.mseLossFast {α : Type} [Inhabited α] [Add α] [Sub α] [Mul α] [Div α] [Zero α] [One α] [Coe α] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (yhatId targetId : ) :

                                          Fast runtime variant of Tape.mse_loss for vector shapes.

                                          PyTorch comparison: torch.nn.functional.mse_loss(..., reduction=\"mean\").

                                          Instances For