Pure specifications for CUDA float32 kernels #
This file is the proof-facing companion to NN.Runtime.Autograd.Engine.Cuda.*.
The native CUDA backend is an FFI boundary, so Lean cannot prove facts about the compiled .cu
binary directly. What we can do well is factor the interface into three layers:
- Pure Lean kernel specs in this file: row-major indexing, elementwise maps, fixed-order reductions, gather/scatter, and batched matmul are ordinary Lean functions over finite indices.
- Scalar float32 facts from
Float32Contract: if native result bits matchIEEE32Exec, then the existingIEEE32Exec → FP32-on-ℝtheorems apply. - Native validation / trust boundary: CUDA C, libdevice, cuBLAS, compiler flags, GPU hardware, and driver behavior are validated by tests and documented assumptions, not proved by Lean.
This split is deliberate. It lets us prove the algorithm/indexing contracts that TorchLean owns, without claiming that the Lean kernel can inspect NVIDIA's compiler, runtime, or device ISA.
External references for the assumptions named here:
- IEEE 754-2019 defines binary32 arithmetic and special values: https://standards.ieee.org/ieee/754/6210/
- NVIDIA CUDA C Programming Guide documents the CUDA execution/memory model: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
- cuBLAS documents GEMM's column-major API contract; TorchLean's CUDA BMM uses a row-major interpretation around that API: https://docs.nvidia.com/cuda/cublas/
- PyTorch's tensor docs are a useful user-facing analogue for row-major/strided tensor operations: https://pytorch.org/docs/stable/tensors.html
Flat row-major buffers #
A pure Lean view of a contiguous float32 buffer of length n.
Native Cuda.Buffer is opaque and mutable outside Lean. FlatBuffer n is the spec counterpart:
every valid linear index has a reference scalar value.
Instances For
A native-result buffer represented only by raw binary32 bits.
Instances For
Reinterpret a native bit buffer as reference IEEE32Exec values.
Instances For
Total lookup for flat-buffer specs that decode indices from arithmetic.
Well-formed kernel preconditions should make the in-bounds branch fire. The fallback keeps the spec total in Lean, and later layout proofs can discharge the bounds separately.
Instances For
Extract reference bits pointwise.
Instances For
Elementwise kernels #
Pure spec for a unary elementwise CUDA kernel.
Instances For
Pure spec for a binary elementwise CUDA kernel.
Instances For
Elementwise addition reference spec.
Instances For
Elementwise multiplication reference spec.
Instances For
Elementwise division reference spec.
Instances For
Elementwise square-root reference spec.
Instances For
If every native result bit agrees with reference addition, the whole native elementwise-add buffer
agrees extensionally with addSpec.
Elementwise multiplication version of fromNativeBitsBuffer_eq_addSpec_of_bits.
Elementwise division version of fromNativeBitsBuffer_eq_addSpec_of_bits.
Elementwise square-root version of fromNativeBitsBuffer_eq_addSpec_of_bits.
Pointwise real-error bound inherited by a native elementwise-add buffer after bit agreement.
Pointwise real-error bound inherited by a native elementwise-multiply buffer after bit agreement.
Pointwise real-error bound inherited by a native elementwise-division buffer after bit agreement.
Pointwise real-error bound inherited by a native elementwise-square-root buffer after bit agreement.
Fixed-order reductions #
Sequential left-fold reduction over a flat buffer.
This is a deterministic algorithmic spec, not a claim about CUDA atomics. Native atomic reductions only refine this spec under an additional ordering/agreement assumption. TorchLean's deterministic reduction mode is intended to make that assumption true for tested reduction paths.
Instances For
Explicit assumption package for a native reduction implementation.
Use this when a native CUDA reduction has been configured or validated to use the same fixed order as
reduceSumLeftSpec. Non-deterministic atomicAdd reductions should not claim this contract unless
the runtime mode or kernel implementation fixes the accumulation order.
Instances For
Native fixed-order reduction inherits the reduceSumLeftSpec reference value.
Gather/scatter indexing #
Gather k elements from a length-n vector using proof-carrying indices.
Instances For
Scatter-add a length-k value buffer into a length-n input buffer.
Repeated indices are accumulated in increasing source-index order. This mirrors the mathematical contract of scatter-add; a native parallel implementation must separately justify or validate its accumulation order when bitwise reproducibility matters.
Instances For
A gather followed by scatter-add to zeros accumulates each selected source position.
Instances For
Batched row-major matrix multiplication #
Linear row-major index for A[b, i, k] with shape (batch, m, n).
Instances For
Linear row-major index for B[b, k, j] with shape (batch, n, p).
Instances For
Linear row-major index for C[b, i, j] with shape (batch, m, p).
Instances For
Pure row-major batched matrix multiplication spec.
For each output element C[b,i,j], this folds over k = 0..n-1 using IEEE32Exec.mul followed by
IEEE32Exec.add. This is intentionally a specific accumulation order. cuBLAS may use a different
tree/FMA strategy, so bit-for-bit agreement with this spec is an explicit native contract, not a
free theorem.
Instances For
Agreement assumption for a native BMM implementation.
The scalar result bits must match bmmSpec at every output element. This is intentionally stronger
than "numerically close": it is the bitwise contract needed to reuse exact IEEE32Exec proofs.
For cuBLAS-backed kernels this assumption includes:
- row-major TorchLean buffers are interpreted consistently around cuBLAS's column-major GEMM API;
- the accumulation tree/FMA behavior is compatible with the selected reference spec, or the spec is adjusted to the documented cuBLAS/toolchain behavior;
- input and output strides match
(batch,m,n),(batch,n,p), and(batch,m,p)row-major layout.
Instances For
Native BMM inherits the pure row-major bmmSpec when the bitwise agreement contract holds.