FastKernels #
Fast (runtime-only) kernels for the eager autograd tape.
This file is not used by the proof-linked compilation path. Instead, it provides drop-in
alternative Tape.* constructors for a few hot ops that are performance bottlenecks when
evaluated via the spec-layer definitions (which are written for proofs/clarity).
Current scope:
linear_fastfor 2D weights × 1D vector + bias (implemented as onematmul+ bias add, so GPU mode routes the multiply throughFastMatmul)mse_loss_fastfor vector shapes (.dim n .scalar)
These are enabled from the PyTorch-style front-end (NN/Runtime/Autograd/Torch/Core.lean) via an
opt-in flag Torch.Options.fastKernels.
Precision selector for GPU-backed fast matmul over Lean Float tensors.
.fp32routes throughCuda.Bufferand cuBLAS SGEMM, matching the precision used by the eager CUDA tensor-buffer stack..fp64routes through the hostFloatArrayDGEMM bridge and cuBLAS DGEMM, preserving LeanFloatprecision for matmul-only research paths.
- fp32 : GpuMatmulPrecision
- fp64 : GpuMatmulPrecision
Instances For
Instances For
Convert a length-n vector tensor to an Array α.
This is a small runtime helper used by the fast kernels to run tight loops (for i in [0:n]).
It is safe because the tensor shape carries n, and the result is constructed via Array.ofFn.
Instances For
Convert an (m×n) matrix tensor into an array-of-rows representation.
This is purely a representation change to make runtime loops faster/easier to write.
Instances For
Dot product of two length-n arrays.
The fast kernels build arrays using Array.ofFn, so both inputs have the right size and get!
is safe in this context.
Instances For
linearForward computes the affine layer forward pass:
y = W x + b
for a 2D weight matrix W : (outDim × inDim) and 1D vectors x : inDim, b : outDim.
This is a runtime-only kernel intended for performance; compare the spec-layer definition, which is optimized for proofs/clarity.
Instances For
linearBackward returns the standard affine-layer gradients.
If y = W x + b and the upstream gradient is g = dL/dy, then:
dL/dW = g ⊗ x(outer product), i.e.dW[i,k] = g[i] * x[k]dL/db = gdL/dx = Wᵀ g, i.e.dx[k] = Σ_i W[i,k] * g[i]
These formulas match the usual backprop rule used by PyTorch autograd for torch.nn.Linear.
See PyTorch "Autograd mechanics": https://pytorch.org/docs/stable/notes/autograd.html
Instances For
View a length-n vector as an (n × 1) matrix (one column, row-major).
Instances For
View a length-n vector as a (1 × n) row matrix.
Instances For
Drop the trivial second dimension (n × 1) → (n).
Instances For
Vector mean-squared error (MSE) with "mean" reduction:
mse(yhat, target) = (Σ_i (yhat_i - target_i)^2) / n.
This corresponds to torch.nn.functional.mse_loss(..., reduction="mean") for a 1D tensor.
Ref: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html
Instances For
Gradient of mseSpecVec with respect to yhat.
If mse = (Σ_i (yhat_i - target_i)^2) / n, then:
∂mse/∂yhat_i = 2 * (yhat_i - target_i) / n.
Instances For
Fast (runtime-only) 2D matmul kernel.
This is a tight-loop kernel (array-of-rows representation) intended to avoid the overhead of the spec-layer definitions when running eager autograd.
Instances For
Fast (runtime-only) matmul dispatch.
This is used by the eager autograd tape fast-kernel variant Tape.matmul_fast. The useGpu flag is
treated as a hint: the default instance ignores it, while the Float instance can route to CUDA.
- matmul2dFast {m n p : ℕ} (useGpu : Bool) (gpuPrecision : GpuMatmulPrecision) : Spec.Tensor α (Spec.Shape.dim m (Spec.Shape.dim n Spec.Shape.scalar)) → Spec.Tensor α (Spec.Shape.dim n (Spec.Shape.dim p Spec.Shape.scalar)) → Spec.Tensor α (Spec.Shape.dim m (Spec.Shape.dim p Spec.Shape.scalar))
Instances
Default FastMatmul: always use the CPU fast-kernel implementation.
2D matmul forward via cuBLAS DGEMM (torchlean_dgemm_cuda / Cuda.torchleanDgemmCuda).
Instances For
2D matmul forward via the float32 CUDA buffer stack.
This path uploads Lean Float values to Cuda.Buffer (rounding to float32), calls the existing
Buffer.bmm SGEMM implementation with batch = 1, then downloads the float32 result back to Lean
Float.
Instances For
Default GPU matmul alias: the FP64 cuBLAS path for Lean Float tensors.
Instances For
Dispatch to the requested GPU matmul precision.
Instances For
Float instance: use an explicit cuBLAS FP32/FP64 path when useGpu=true.
Fast affine forward y = W x + b using a single 2D matmul (W is (outDim × inDim), x as
(inDim × 1)), then add_spec for the bias.
When useGpu is true and α = Float, FastMatmul routes the multiply to the requested cuBLAS
precision (if built with -K cuda=true); otherwise the same path uses the CPU tight-loop matmul.
Instances For
Fast affine backward using three matmul-shaped operations (two GEMMs + bias copy).
Matches the usual Linear autograd: dW = g xᵀ, db = g, dx = Wᵀ g.
Instances For
Fast (runtime-only) matmul backward rule computed via two matmuls and transposes.
If y = a*b and g = dL/dy, then:
dL/da = g * bᵀdL/db = aᵀ * g
Instances For
Fast runtime variant of Tape.linear.
This is intended as a drop-in replacement for the affine layer y = W x + b when the shapes are
exactly W : (outDim×inDim), b : outDim, x : inDim. It implements the multiply via
FastMatmul (one GEMM-shaped matmul + bias), so Torch.Options.useGpu can route Float matmuls to
cuBLAS when enabled.
PyTorch comparison: torch.nn.Linear / torch.nn.functional.linear.
Reference: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html
Instances For
Fast runtime variant of Tape.matmul for 2D tensors.
This is a runtime-only optimization: it uses the eager fast-kernel implementation
FastKernels.matmulForward rather than the spec-layer definition.
Instances For
Fast runtime variant of Tape.mse_loss for vector shapes.
- If
s = .dim n .scalar, we useFastKernels.mseSpecVec/FastKernels.mseDerivVec. - If
s = .scalaror a non-vector tensor, we fall back to the genericTape.mse_loss.
PyTorch comparison: torch.nn.functional.mse_loss(..., reduction=\"mean\").