TorchLean API

Docs Home Guide Examples Graphs

NN.Runtime.Autograd.Engine.FastKernels

FastKernels #

Fast (runtime-only) kernels for the eager autograd tape.

This file is not used by the proof-linked compilation path. Instead, it provides drop-in alternative Tape.* constructors for a few hot ops that are performance bottlenecks when evaluated via the spec-layer definitions (which are written for proofs/clarity).

Current scope:

linear_fast for 2D weights × 1D vector + bias (implemented as one matmul + bias add, so GPU mode routes the multiply through FastMatmul)
mse_loss_fast for vector shapes (.dim n .scalar)

These are enabled from the PyTorch-style front-end (NN/Runtime/Autograd/Torch/Core.lean) via an opt-in flag Torch.Options.fastKernels.

inductive Runtime.Autograd.FastKernels.GpuMatmulPrecision :

Precision selector for GPU-backed fast matmul over Lean Float tensors.

.fp32 routes through Cuda.Buffer and cuBLAS SGEMM, matching the precision used by the eager CUDA tensor-buffer stack.
.fp64 routes through the host FloatArray DGEMM bridge and cuBLAS DGEMM, preserving Lean Float precision for matmul-only research paths.

fp32 : GpuMatmulPrecision
fp64 : GpuMatmulPrecision

Instances For

def Runtime.Autograd.FastKernels.instReprGpuMatmulPrecision.repr :

GpuMatmulPrecision → ℕ → Std.Format

Instances For

@[implicit_reducible]

instance Runtime.Autograd.FastKernels.instReprGpuMatmulPrecision :

Repr GpuMatmulPrecision

@[implicit_reducible]

instance Runtime.Autograd.FastKernels.instDecidableEqGpuMatmulPrecision :

DecidableEq GpuMatmulPrecision

def Runtime.Autograd.FastKernels.vecToArray {α : Type} {n : ℕ} :

Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar) → Array α

Convert a length-n vector tensor to an Array α.

This is a small runtime helper used by the fast kernels to run tight loops (for i in [0:n]). It is safe because the tensor shape carries n, and the result is constructed via Array.ofFn.

Instances For

def Runtime.Autograd.FastKernels.matToRows {α : Type} {m n : ℕ} :

Spec.Tensor α (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar)) → Array (Array α)

Convert an (m×n) matrix tensor into an array-of-rows representation.

This is purely a representation change to make runtime loops faster/easier to write.

Instances For

def Runtime.Autograd.FastKernels.dot {α : Type} [Inhabited α] [Add α] [Mul α] [Zero α] (n : ℕ) (a b : Array α) :

α

Dot product of two length-n arrays.

The fast kernels build arrays using Array.ofFn, so both inputs have the right size and get! is safe in this context.

Instances For

def Runtime.Autograd.FastKernels.linearForward {α : Type} [Inhabited α] [Add α] [Mul α] [Zero α] {inDim outDim : ℕ} (W : Spec.Tensor α (Spec.Shape.dim outDim (Spec.Shape.dim inDim Spec.Shape.scalar))) (b : Spec.Tensor α (Spec.Shape.dim outDim Spec.Shape.scalar)) (x : Spec.Tensor α (Spec.Shape.dim inDim Spec.Shape.scalar)) :

Spec.Tensor α (Spec.Shape.dim outDim Spec.Shape.scalar)

linearForward computes the affine layer forward pass:

y = W x + b

for a 2D weight matrix W : (outDim × inDim) and 1D vectors x : inDim, b : outDim.

This is a runtime-only kernel intended for performance; compare the spec-layer definition, which is optimized for proofs/clarity.

Instances For

def Runtime.Autograd.FastKernels.linearBackward {α : Type} [Inhabited α] [Add α] [Mul α] [Zero α] {inDim outDim : ℕ} (W : Spec.Tensor α (Spec.Shape.dim outDim (Spec.Shape.dim inDim Spec.Shape.scalar))) (x : Spec.Tensor α (Spec.Shape.dim inDim Spec.Shape.scalar)) (dLdy : Spec.Tensor α (Spec.Shape.dim outDim Spec.Shape.scalar)) :

Spec.Tensor α (Spec.Shape.dim outDim (Spec.Shape.dim inDim Spec.Shape.scalar)) × Spec.Tensor α (Spec.Shape.dim outDim Spec.Shape.scalar) × Spec.Tensor α (Spec.Shape.dim inDim Spec.Shape.scalar)

linearBackward returns the standard affine-layer gradients.

If y = W x + b and the upstream gradient is g = dL/dy, then:

dL/dW = g ⊗ x (outer product), i.e. dW[i,k] = g[i] * x[k]
dL/db = g
dL/dx = Wᵀ g, i.e. dx[k] = Σ_i W[i,k] * g[i]

These formulas match the usual backprop rule used by PyTorch autograd for torch.nn.Linear. See PyTorch "Autograd mechanics": https://pytorch.org/docs/stable/notes/autograd.html

Instances For

def Runtime.Autograd.FastKernels.vecAsColMat {α : Type} {n : ℕ} (x : Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)) :

Spec.Tensor α (Spec.Shape.dim n (Spec.Shape.dim 1 Spec.Shape.scalar))

View a length-n vector as an (n × 1) matrix (one column, row-major).

Instances For

def Runtime.Autograd.FastKernels.vecAsRowMat {α : Type} {n : ℕ} (x : Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)) :

Spec.Tensor α (Spec.Shape.dim 1 (Spec.Shape.dim n Spec.Shape.scalar))

View a length-n vector as a (1 × n) row matrix.

Instances For

def Runtime.Autograd.FastKernels.colMatAsVec {α : Type} {n : ℕ} (x : Spec.Tensor α (Spec.Shape.dim n (Spec.Shape.dim 1 Spec.Shape.scalar))) :

Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)

Drop the trivial second dimension (n × 1) → (n).

Instances For

def Runtime.Autograd.FastKernels.mseSpecVec {α : Type} [Inhabited α] [Add α] [Sub α] [Mul α] [Div α] [Zero α] [Coe ℕ α] {n : ℕ} (yhat target : Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)) :

α

Vector mean-squared error (MSE) with "mean" reduction:

mse(yhat, target) = (Σ_i (yhat_i - target_i)^2) / n.

This corresponds to torch.nn.functional.mse_loss(..., reduction="mean") for a 1D tensor. Ref: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html

Instances For

def Runtime.Autograd.FastKernels.mseDerivVec {α : Type} [Inhabited α] [Add α] [Sub α] [Mul α] [Div α] [Zero α] [One α] [Coe ℕ α] {n : ℕ} (yhat target : Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)) :

Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)

Gradient of mseSpecVec with respect to yhat.

If mse = (Σ_i (yhat_i - target_i)^2) / n, then:

∂mse/∂yhat_i = 2 * (yhat_i - target_i) / n.

Instances For

def Runtime.Autograd.FastKernels.matmulForward {α : Type} [Context α] {m n p : ℕ} (a : Spec.Tensor α (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar))) (b : Spec.Tensor α (Spec.Shape.dim n (Spec.Shape.dim p Spec.Shape.scalar))) :

Spec.Tensor α (Spec.Shape.dim m (Spec.Shape.dim p Spec.Shape.scalar))

Fast (runtime-only) 2D matmul kernel.

This is a tight-loop kernel (array-of-rows representation) intended to avoid the overhead of the spec-layer definitions when running eager autograd.

Instances For

class Runtime.Autograd.FastKernels.FastMatmul (α : Type) :

Fast (runtime-only) matmul dispatch.

This is used by the eager autograd tape fast-kernel variant Tape.matmul_fast. The useGpu flag is treated as a hint: the default instance ignores it, while the Float instance can route to CUDA.

matmul2dFast {m n p : ℕ} (useGpu : Bool) (gpuPrecision : GpuMatmulPrecision) : Spec.Tensor α (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar)) → Spec.Tensor α (Spec.Shape.dim n (Spec.Shape.dim p Spec.Shape.scalar)) → Spec.Tensor α (Spec.Shape.dim m (Spec.Shape.dim p Spec.Shape.scalar))

Instances

@[implicit_reducible, instance 10]

instance Runtime.Autograd.FastKernels.instFastMatmulOfContext {α : Type} [Context α] :

Default FastMatmul: always use the CPU fast-kernel implementation.

def Runtime.Autograd.FastKernels.Cuda.matmulForwardcuBLAS64 {m n p : ℕ} (a : Spec.Tensor Float (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar))) (b : Spec.Tensor Float (Spec.Shape.dim n (Spec.Shape.dim p Spec.Shape.scalar))) :

Spec.Tensor Float (Spec.Shape.dim m (Spec.Shape.dim p Spec.Shape.scalar))

2D matmul forward via cuBLAS DGEMM (torchlean_dgemm_cuda / Cuda.torchleanDgemmCuda).

Instances For

def Runtime.Autograd.FastKernels.Cuda.matmulForwardcuBLAS32 {m n p : ℕ} (a : Spec.Tensor Float (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar))) (b : Spec.Tensor Float (Spec.Shape.dim n (Spec.Shape.dim p Spec.Shape.scalar))) :

Spec.Tensor Float (Spec.Shape.dim m (Spec.Shape.dim p Spec.Shape.scalar))

2D matmul forward via the float32 CUDA buffer stack.

This path uploads Lean Float values to Cuda.Buffer (rounding to float32), calls the existing Buffer.bmm SGEMM implementation with batch = 1, then downloads the float32 result back to Lean Float.

Instances For

@[reducible, inline]

abbrev Runtime.Autograd.FastKernels.Cuda.matmulForwardcuBLAS {m n p : ℕ} (a : Spec.Tensor Float (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar))) (b : Spec.Tensor Float (Spec.Shape.dim n (Spec.Shape.dim p Spec.Shape.scalar))) :

Spec.Tensor Float (Spec.Shape.dim m (Spec.Shape.dim p Spec.Shape.scalar))

Default GPU matmul alias: the FP64 cuBLAS path for Lean Float tensors.

Instances For

def Runtime.Autograd.FastKernels.Cuda.matmulForwardcuBLASWith (precision : GpuMatmulPrecision) {m n p : ℕ} (a : Spec.Tensor Float (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar))) (b : Spec.Tensor Float (Spec.Shape.dim n (Spec.Shape.dim p Spec.Shape.scalar))) :

Spec.Tensor Float (Spec.Shape.dim m (Spec.Shape.dim p Spec.Shape.scalar))

Dispatch to the requested GPU matmul precision.

Instances For

@[implicit_reducible]

instance Runtime.Autograd.FastKernels.instFastMatmulFloat :

FastMatmul Float

Float instance: use an explicit cuBLAS FP32/FP64 path when useGpu=true.

def Runtime.Autograd.FastKernels.linearForwardFast {α : Type} [Add α] [FastMatmul α] {inDim outDim : ℕ} (useGpu : Bool) (gpuPrecision : GpuMatmulPrecision := GpuMatmulPrecision.fp32) (W : Spec.Tensor α (Spec.Shape.dim outDim (Spec.Shape.dim inDim Spec.Shape.scalar))) (b : Spec.Tensor α (Spec.Shape.dim outDim Spec.Shape.scalar)) (x : Spec.Tensor α (Spec.Shape.dim inDim Spec.Shape.scalar)) :

Spec.Tensor α (Spec.Shape.dim outDim Spec.Shape.scalar)

Fast affine forward y = W x + b using a single 2D matmul (W is (outDim × inDim), x as (inDim × 1)), then add_spec for the bias.

When useGpu is true and α = Float, FastMatmul routes the multiply to the requested cuBLAS precision (if built with -K cuda=true); otherwise the same path uses the CPU tight-loop matmul.

Instances For

def Runtime.Autograd.FastKernels.linearBackwardFast {α : Type} [Add α] [Mul α] [Zero α] [FastMatmul α] {inDim outDim : ℕ} (useGpu : Bool) (gpuPrecision : GpuMatmulPrecision := GpuMatmulPrecision.fp32) (W : Spec.Tensor α (Spec.Shape.dim outDim (Spec.Shape.dim inDim Spec.Shape.scalar))) (x : Spec.Tensor α (Spec.Shape.dim inDim Spec.Shape.scalar)) (dLdy : Spec.Tensor α (Spec.Shape.dim outDim Spec.Shape.scalar)) :

Spec.Tensor α (Spec.Shape.dim outDim (Spec.Shape.dim inDim Spec.Shape.scalar)) × Spec.Tensor α (Spec.Shape.dim outDim Spec.Shape.scalar) × Spec.Tensor α (Spec.Shape.dim inDim Spec.Shape.scalar)

Fast affine backward using three matmul-shaped operations (two GEMMs + bias copy).

Matches the usual Linear autograd: dW = g xᵀ, db = g, dx = Wᵀ g.

Instances For

def Runtime.Autograd.FastKernels.matmulBackward {α : Type} [Context α] [FastMatmul α] {m n p : ℕ} (useGpu : Bool) (gpuPrecision : GpuMatmulPrecision := GpuMatmulPrecision.fp32) (a : Spec.Tensor α (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar))) (b : Spec.Tensor α (Spec.Shape.dim n (Spec.Shape.dim p Spec.Shape.scalar))) (dLdy : Spec.Tensor α (Spec.Shape.dim m (Spec.Shape.dim p Spec.Shape.scalar))) :

Spec.Tensor α (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar)) × Spec.Tensor α (Spec.Shape.dim n (Spec.Shape.dim p Spec.Shape.scalar))

Fast (runtime-only) matmul backward rule computed via two matmuls and transposes.

If y = a*b and g = dL/dy, then:

dL/da = g * bᵀ
dL/db = aᵀ * g

Instances For

def Runtime.Autograd.Tape.linearFast {α : Type} [Add α] [Mul α] [Zero α] [DecidableEq Spec.Shape] [FastKernels.FastMatmul α] {inDim outDim : ℕ} (useGpu : Bool) (gpuPrecision : FastKernels.GpuMatmulPrecision := FastKernels.GpuMatmulPrecision.fp32) (t : Tape α) (wId bId xId : ℕ) :

Result (Tape α × ℕ)

Fast runtime variant of Tape.linear.

This is intended as a drop-in replacement for the affine layer y = W x + b when the shapes are exactly W : (outDim×inDim), b : outDim, x : inDim. It implements the multiply via FastMatmul (one GEMM-shaped matmul + bias), so Torch.Options.useGpu can route Float matmuls to cuBLAS when enabled.

PyTorch comparison: torch.nn.Linear / torch.nn.functional.linear. Reference: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html

Instances For

def Runtime.Autograd.Tape.matmulFast {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] [FastKernels.FastMatmul α] {m n p : ℕ} (useGpu : Bool) (gpuPrecision : FastKernels.GpuMatmulPrecision := FastKernels.GpuMatmulPrecision.fp32) (t : Tape α) (aId bId : ℕ) :

Result (Tape α × ℕ)

Fast runtime variant of Tape.matmul for 2D tensors.

This is a runtime-only optimization: it uses the eager fast-kernel implementation FastKernels.matmulForward rather than the spec-layer definition.

Instances For

def Runtime.Autograd.Tape.mseLossFast {α : Type} [Inhabited α] [Add α] [Sub α] [Mul α] [Div α] [Zero α] [One α] [Coe ℕ α] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (yhatId targetId : ℕ) :

Result (Tape α × ℕ)

Fast runtime variant of Tape.mse_loss for vector shapes.

If s = .dim n .scalar, we use FastKernels.mseSpecVec/FastKernels.mseDerivVec.
If s = .scalar or a non-vector tensor, we fall back to the generic Tape.mse_loss.

PyTorch comparison: torch.nn.functional.mse_loss(..., reduction=\"mean\").

Instances For