TorchLean API

Docs Home Guide Examples Graphs

NN.Runtime.Autograd.Engine.Cuda.KernelSpec

Pure specifications for CUDA float32 kernels #

This file is the proof-facing companion to NN.Runtime.Autograd.Engine.Cuda.*.

The native CUDA backend is an FFI boundary, so Lean cannot prove facts about the compiled .cu binary directly. What we can do well is factor the interface into three layers:

Pure Lean kernel specs in this file: row-major indexing, elementwise maps, fixed-order reductions, gather/scatter, and batched matmul are ordinary Lean functions over finite indices.
Scalar float32 facts from Float32Contract: if native result bits match IEEE32Exec, then the existing IEEE32Exec → FP32-on-ℝ theorems apply.
Native validation / trust boundary: CUDA C, libdevice, cuBLAS, compiler flags, GPU hardware, and driver behavior are validated by tests and documented assumptions, not proved by Lean.

This split is deliberate. It lets us prove the algorithm/indexing contracts that TorchLean owns, without claiming that the Lean kernel can inspect NVIDIA's compiler, runtime, or device ISA.

External references for the assumptions named here:

IEEE 754-2019 defines binary32 arithmetic and special values: https://standards.ieee.org/ieee/754/6210/
NVIDIA CUDA C Programming Guide documents the CUDA execution/memory model: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
cuBLAS documents GEMM's column-major API contract; TorchLean's CUDA BMM uses a row-major interpretation around that API: https://docs.nvidia.com/cuda/cublas/
PyTorch's tensor docs are a useful user-facing analogue for row-major/strided tensor operations: https://pytorch.org/docs/stable/tensors.html

Flat row-major buffers #

@[reducible, inline]

abbrev Runtime.Autograd.Cuda.KernelSpec.FlatBuffer (n : ℕ) :

A pure Lean view of a contiguous float32 buffer of length n.

Native Cuda.Buffer is opaque and mutable outside Lean. FlatBuffer n is the spec counterpart: every valid linear index has a reference scalar value.

Instances For

@[reducible, inline]

abbrev Runtime.Autograd.Cuda.KernelSpec.NativeBitsBuffer (n : ℕ) :

A native-result buffer represented only by raw binary32 bits.

Instances For

def Runtime.Autograd.Cuda.KernelSpec.fromNativeBitsBuffer {n : ℕ} (xs : NativeBitsBuffer n) :

Reinterpret a native bit buffer as reference IEEE32Exec values.

Instances For

def Runtime.Autograd.Cuda.KernelSpec.getD {n : ℕ} (x : FlatBuffer n) (i : ℕ) :

Float32Contract.RefScalar

Total lookup for flat-buffer specs that decode indices from arithmetic.

Well-formed kernel preconditions should make the in-bounds branch fire. The fallback keeps the spec total in Lean, and later layout proofs can discharge the bounds separately.

Instances For

def Runtime.Autograd.Cuda.KernelSpec.toNativeBitsBuffer {n : ℕ} (xs : FlatBuffer n) :

NativeBitsBuffer n

Extract reference bits pointwise.

Instances For

@[simp]

theorem Runtime.Autograd.Cuda.KernelSpec.fromNativeBitsBuffer_toNativeBitsBuffer {n : ℕ} (xs : FlatBuffer n) :

fromNativeBitsBuffer (toNativeBitsBuffer xs) = xs

@[simp]

theorem Runtime.Autograd.Cuda.KernelSpec.toNativeBitsBuffer_fromNativeBitsBuffer {n : ℕ} (xs : NativeBitsBuffer n) :

toNativeBitsBuffer (fromNativeBitsBuffer xs) = xs

Elementwise kernels #

def Runtime.Autograd.Cuda.KernelSpec.mapSpec {n : ℕ} (f : Float32Contract.RefScalar → Float32Contract.RefScalar) (x : FlatBuffer n) :

Pure spec for a unary elementwise CUDA kernel.

Instances For

def Runtime.Autograd.Cuda.KernelSpec.map2Spec {n : ℕ} (f : Float32Contract.RefScalar → Float32Contract.RefScalar → Float32Contract.RefScalar) (x y : FlatBuffer n) :

Pure spec for a binary elementwise CUDA kernel.

Instances For

def Runtime.Autograd.Cuda.KernelSpec.addSpec {n : ℕ} :

FlatBuffer n → FlatBuffer n → FlatBuffer n

Elementwise addition reference spec.

Instances For

def Runtime.Autograd.Cuda.KernelSpec.mulSpec {n : ℕ} :

FlatBuffer n → FlatBuffer n → FlatBuffer n

Elementwise multiplication reference spec.

Instances For

def Runtime.Autograd.Cuda.KernelSpec.divSpec {n : ℕ} :

FlatBuffer n → FlatBuffer n → FlatBuffer n

Elementwise division reference spec.

Instances For

def Runtime.Autograd.Cuda.KernelSpec.sqrtSpec {n : ℕ} :

FlatBuffer n → FlatBuffer n

Elementwise square-root reference spec.

Instances For

theorem Runtime.Autograd.Cuda.KernelSpec.fromNativeBitsBuffer_eq_addSpec_of_bits {n : ℕ} {bits : NativeBitsBuffer n} {x y : FlatBuffer n} (hbits : ∀ (i : Fin n), bits i = Float32Contract.toNativeBits (TorchLean.Floats.IEEE754.IEEE32Exec.add (x i) (y i))) :

fromNativeBitsBuffer bits = addSpec x y

If every native result bit agrees with reference addition, the whole native elementwise-add buffer agrees extensionally with addSpec.

theorem Runtime.Autograd.Cuda.KernelSpec.fromNativeBitsBuffer_eq_mulSpec_of_bits {n : ℕ} {bits : NativeBitsBuffer n} {x y : FlatBuffer n} (hbits : ∀ (i : Fin n), bits i = Float32Contract.toNativeBits (TorchLean.Floats.IEEE754.IEEE32Exec.mul (x i) (y i))) :

fromNativeBitsBuffer bits = mulSpec x y

Elementwise multiplication version of fromNativeBitsBuffer_eq_addSpec_of_bits.

theorem Runtime.Autograd.Cuda.KernelSpec.fromNativeBitsBuffer_eq_divSpec_of_bits {n : ℕ} {bits : NativeBitsBuffer n} {x y : FlatBuffer n} (hbits : ∀ (i : Fin n), bits i = Float32Contract.toNativeBits (TorchLean.Floats.IEEE754.IEEE32Exec.div (x i) (y i))) :

fromNativeBitsBuffer bits = divSpec x y

Elementwise division version of fromNativeBitsBuffer_eq_addSpec_of_bits.

theorem Runtime.Autograd.Cuda.KernelSpec.fromNativeBitsBuffer_eq_sqrtSpec_of_bits {n : ℕ} {bits : NativeBitsBuffer n} {x : FlatBuffer n} (hbits : ∀ (i : Fin n), bits i = Float32Contract.toNativeBits (TorchLean.Floats.IEEE754.IEEE32Exec.sqrt (x i))) :

fromNativeBitsBuffer bits = sqrtSpec x

Elementwise square-root version of fromNativeBitsBuffer_eq_addSpec_of_bits.

theorem Runtime.Autograd.Cuda.KernelSpec.native_add_pointwise_abs_error_of_bits {n : ℕ} {bits : NativeBitsBuffer n} {x y : FlatBuffer n} (hbits : ∀ (i : Fin n), bits i = Float32Contract.toNativeBits (TorchLean.Floats.IEEE754.IEEE32Exec.add (x i) (y i))) (i : Fin n) (hfin : TorchLean.Floats.IEEE754.IEEE32Exec.isFinite (Float32Contract.fromNativeBits (bits i)) = true) :

|TorchLean.Floats.IEEE754.IEEE32Exec.toReal (Float32Contract.fromNativeBits (bits i)) - (TorchLean.Floats.IEEE754.IEEE32Exec.toReal (x i) + TorchLean.Floats.IEEE754.IEEE32Exec.toReal (y i))| ≤ TorchLean.Floats.eps₃₂ (TorchLean.Floats.IEEE754.IEEE32Exec.toReal (x i) + TorchLean.Floats.IEEE754.IEEE32Exec.toReal (y i))

Pointwise real-error bound inherited by a native elementwise-add buffer after bit agreement.

theorem Runtime.Autograd.Cuda.KernelSpec.native_mul_pointwise_abs_error_of_bits {n : ℕ} {bits : NativeBitsBuffer n} {x y : FlatBuffer n} (hbits : ∀ (i : Fin n), bits i = Float32Contract.toNativeBits (TorchLean.Floats.IEEE754.IEEE32Exec.mul (x i) (y i))) (i : Fin n) (hfin : TorchLean.Floats.IEEE754.IEEE32Exec.isFinite (Float32Contract.fromNativeBits (bits i)) = true) :

|TorchLean.Floats.IEEE754.IEEE32Exec.toReal (Float32Contract.fromNativeBits (bits i)) - TorchLean.Floats.IEEE754.IEEE32Exec.toReal (x i) * TorchLean.Floats.IEEE754.IEEE32Exec.toReal (y i)| ≤ TorchLean.Floats.eps₃₂ (TorchLean.Floats.IEEE754.IEEE32Exec.toReal (x i) * TorchLean.Floats.IEEE754.IEEE32Exec.toReal (y i))

Pointwise real-error bound inherited by a native elementwise-multiply buffer after bit agreement.

theorem Runtime.Autograd.Cuda.KernelSpec.native_div_pointwise_abs_error_of_bits {n : ℕ} {bits : NativeBitsBuffer n} {x y : FlatBuffer n} (hbits : ∀ (i : Fin n), bits i = Float32Contract.toNativeBits (TorchLean.Floats.IEEE754.IEEE32Exec.div (x i) (y i))) (i : Fin n) (hfin : TorchLean.Floats.IEEE754.IEEE32Exec.isFinite (Float32Contract.fromNativeBits (bits i)) = true) :

|TorchLean.Floats.IEEE754.IEEE32Exec.toReal (Float32Contract.fromNativeBits (bits i)) - TorchLean.Floats.IEEE754.IEEE32Exec.toReal (x i) / TorchLean.Floats.IEEE754.IEEE32Exec.toReal (y i)| ≤ TorchLean.Floats.eps₃₂ (TorchLean.Floats.IEEE754.IEEE32Exec.toReal (x i) / TorchLean.Floats.IEEE754.IEEE32Exec.toReal (y i))

Pointwise real-error bound inherited by a native elementwise-division buffer after bit agreement.

theorem Runtime.Autograd.Cuda.KernelSpec.native_sqrt_pointwise_abs_error_of_bits {n : ℕ} {bits : NativeBitsBuffer n} {x : FlatBuffer n} (hbits : ∀ (i : Fin n), bits i = Float32Contract.toNativeBits (TorchLean.Floats.IEEE754.IEEE32Exec.sqrt (x i))) (i : Fin n) (hfin : TorchLean.Floats.IEEE754.IEEE32Exec.isFinite (Float32Contract.fromNativeBits (bits i)) = true) :

|TorchLean.Floats.IEEE754.IEEE32Exec.toReal (Float32Contract.fromNativeBits (bits i)) - √(TorchLean.Floats.IEEE754.IEEE32Exec.toReal (x i))| ≤ TorchLean.Floats.eps₃₂ √(TorchLean.Floats.IEEE754.IEEE32Exec.toReal (x i))

Pointwise real-error bound inherited by a native elementwise-square-root buffer after bit agreement.

Fixed-order reductions #

def Runtime.Autograd.Cuda.KernelSpec.reduceSumLeftSpec {n : ℕ} (x : FlatBuffer n) :

Float32Contract.RefScalar

Sequential left-fold reduction over a flat buffer.

This is a deterministic algorithmic spec, not a claim about CUDA atomics. Native atomic reductions only refine this spec under an additional ordering/agreement assumption. TorchLean's deterministic reduction mode is intended to make that assumption true for tested reduction paths.

Instances For

structure Runtime.Autograd.Cuda.KernelSpec.NativeReduceAgreement {n : ℕ} (nativeBits : UInt32) (x : FlatBuffer n) :

Explicit assumption package for a native reduction implementation.

Use this when a native CUDA reduction has been configured or validated to use the same fixed order as reduceSumLeftSpec. Non-deterministic atomicAdd reductions should not claim this contract unless the runtime mode or kernel implementation fixes the accumulation order.

bits_eq_left_fold : nativeBits = Float32Contract.toNativeBits (reduceSumLeftSpec x)

Instances For

theorem Runtime.Autograd.Cuda.KernelSpec.native_reduce_eq_leftSpec {n : ℕ} {nativeBits : UInt32} {x : FlatBuffer n} (h : NativeReduceAgreement nativeBits x) :

Float32Contract.fromNativeBits nativeBits = reduceSumLeftSpec x

Native fixed-order reduction inherits the reduceSumLeftSpec reference value.

Gather/scatter indexing #

def Runtime.Autograd.Cuda.KernelSpec.gatherVecSpec {n k : ℕ} (x : FlatBuffer n) (idx : Fin k → Fin n) :

Gather k elements from a length-n vector using proof-carrying indices.

Instances For

def Runtime.Autograd.Cuda.KernelSpec.scatterAddSpec {n k : ℕ} (x : FlatBuffer n) (values : FlatBuffer k) (idx : Fin k → Fin n) :

Scatter-add a length-k value buffer into a length-n input buffer.

Repeated indices are accumulated in increasing source-index order. This mirrors the mathematical contract of scatter-add; a native parallel implementation must separately justify or validate its accumulation order when bitwise reproducibility matters.

Instances For

def Runtime.Autograd.Cuda.KernelSpec.gatherThenScatterToZeroSpec {n k : ℕ} (x : FlatBuffer n) (idx : Fin k → Fin n) :

A gather followed by scatter-add to zeros accumulates each selected source position.

Instances For

Batched row-major matrix multiplication #

def Runtime.Autograd.Cuda.KernelSpec.bmmAIndex (m n b i k : ℕ) :

Linear row-major index for A[b, i, k] with shape (batch, m, n).

Instances For

def Runtime.Autograd.Cuda.KernelSpec.bmmBIndex (n p b k j : ℕ) :

Linear row-major index for B[b, k, j] with shape (batch, n, p).

Instances For

def Runtime.Autograd.Cuda.KernelSpec.bmmCIndex (m p b i j : ℕ) :

Linear row-major index for C[b, i, j] with shape (batch, m, p).

Instances For

def Runtime.Autograd.Cuda.KernelSpec.bmmDecodeC (m p q : ℕ) :

ℕ × ℕ × ℕ

Decode a flat row-major output index for shape (batch, m, p).

Instances For

def Runtime.Autograd.Cuda.KernelSpec.bmmSpec (batch m n p : ℕ) (A : FlatBuffer (batch * m * n)) (B : FlatBuffer (batch * n * p)) :

FlatBuffer (batch * m * p)

Pure row-major batched matrix multiplication spec.

For each output element C[b,i,j], this folds over k = 0..n-1 using IEEE32Exec.mul followed by IEEE32Exec.add. This is intentionally a specific accumulation order. cuBLAS may use a different tree/FMA strategy, so bit-for-bit agreement with this spec is an explicit native contract, not a free theorem.

Instances For

structure Runtime.Autograd.Cuda.KernelSpec.NativeBmmAgreement {batch m n p : ℕ} (nativeBits : NativeBitsBuffer (batch * m * p)) (A : FlatBuffer (batch * m * n)) (B : FlatBuffer (batch * n * p)) :

Agreement assumption for a native BMM implementation.

The scalar result bits must match bmmSpec at every output element. This is intentionally stronger than "numerically close": it is the bitwise contract needed to reuse exact IEEE32Exec proofs.

For cuBLAS-backed kernels this assumption includes:

row-major TorchLean buffers are interpreted consistently around cuBLAS's column-major GEMM API;
the accumulation tree/FMA behavior is compatible with the selected reference spec, or the spec is adjusted to the documented cuBLAS/toolchain behavior;
input and output strides match (batch,m,n), (batch,n,p), and (batch,m,p) row-major layout.

bits_eq_bmmSpec (q : Fin (batch * m * p)) : nativeBits q = Float32Contract.toNativeBits (bmmSpec batch m n p A B q)

Instances For

theorem Runtime.Autograd.Cuda.KernelSpec.native_bmm_eq_spec {batch m n p : ℕ} {nativeBits : NativeBitsBuffer (batch * m * p)} {A : FlatBuffer (batch * m * n)} {B : FlatBuffer (batch * n * p)} (h : NativeBmmAgreement nativeBits A B) :

fromNativeBitsBuffer nativeBits = bmmSpec batch m n p A B

Native BMM inherits the pure row-major bmmSpec when the bitwise agreement contract holds.