TorchLean API

NN.Runtime.Autograd.Engine.Cuda.KernelSpec

Pure specifications for CUDA float32 kernels #

This file is the proof-facing companion to NN.Runtime.Autograd.Engine.Cuda.*.

The native CUDA backend is an FFI boundary, so Lean cannot prove facts about the compiled .cu binary directly. What we can do well is factor the interface into three layers:

  1. Pure Lean kernel specs in this file: row-major indexing, elementwise maps, fixed-order reductions, gather/scatter, and batched matmul are ordinary Lean functions over finite indices.
  2. Scalar float32 facts from Float32Contract: if native result bits match IEEE32Exec, then the existing IEEE32Exec → FP32-on-ℝ theorems apply.
  3. Native validation / trust boundary: CUDA C, libdevice, cuBLAS, compiler flags, GPU hardware, and driver behavior are validated by tests and documented assumptions, not proved by Lean.

This split is deliberate. It lets us prove the algorithm/indexing contracts that TorchLean owns, without claiming that the Lean kernel can inspect NVIDIA's compiler, runtime, or device ISA.

External references for the assumptions named here:

Flat row-major buffers #

@[reducible, inline]

A pure Lean view of a contiguous float32 buffer of length n.

Native Cuda.Buffer is opaque and mutable outside Lean. FlatBuffer n is the spec counterpart: every valid linear index has a reference scalar value.

Instances For
    @[reducible, inline]

    A native-result buffer represented only by raw binary32 bits.

    Instances For

      Reinterpret a native bit buffer as reference IEEE32Exec values.

      Instances For

        Total lookup for flat-buffer specs that decode indices from arithmetic.

        Well-formed kernel preconditions should make the in-bounds branch fire. The fallback keeps the spec total in Lean, and later layout proofs can discharge the bounds separately.

        Instances For

          Extract reference bits pointwise.

          Instances For

            Elementwise kernels #

            Pure spec for a unary elementwise CUDA kernel.

            Instances For

              Pure spec for a binary elementwise CUDA kernel.

              Instances For

                Elementwise addition reference spec.

                Instances For

                  Elementwise multiplication reference spec.

                  Instances For

                    Elementwise division reference spec.

                    Instances For

                      Elementwise square-root reference spec.

                      Instances For

                        If every native result bit agrees with reference addition, the whole native elementwise-add buffer agrees extensionally with addSpec.

                        Fixed-order reductions #

                        Sequential left-fold reduction over a flat buffer.

                        This is a deterministic algorithmic spec, not a claim about CUDA atomics. Native atomic reductions only refine this spec under an additional ordering/agreement assumption. TorchLean's deterministic reduction mode is intended to make that assumption true for tested reduction paths.

                        Instances For

                          Explicit assumption package for a native reduction implementation.

                          Use this when a native CUDA reduction has been configured or validated to use the same fixed order as reduceSumLeftSpec. Non-deterministic atomicAdd reductions should not claim this contract unless the runtime mode or kernel implementation fixes the accumulation order.

                          Instances For

                            Native fixed-order reduction inherits the reduceSumLeftSpec reference value.

                            Gather/scatter indexing #

                            Gather k elements from a length-n vector using proof-carrying indices.

                            Instances For
                              def Runtime.Autograd.Cuda.KernelSpec.scatterAddSpec {n k : } (x : FlatBuffer n) (values : FlatBuffer k) (idx : Fin kFin n) :

                              Scatter-add a length-k value buffer into a length-n input buffer.

                              Repeated indices are accumulated in increasing source-index order. This mirrors the mathematical contract of scatter-add; a native parallel implementation must separately justify or validate its accumulation order when bitwise reproducibility matters.

                              Instances For

                                A gather followed by scatter-add to zeros accumulates each selected source position.

                                Instances For

                                  Batched row-major matrix multiplication #

                                  Linear row-major index for A[b, i, k] with shape (batch, m, n).

                                  Instances For

                                    Linear row-major index for B[b, k, j] with shape (batch, n, p).

                                    Instances For

                                      Linear row-major index for C[b, i, j] with shape (batch, m, p).

                                      Instances For

                                        Decode a flat row-major output index for shape (batch, m, p).

                                        Instances For
                                          def Runtime.Autograd.Cuda.KernelSpec.bmmSpec (batch m n p : ) (A : FlatBuffer (batch * m * n)) (B : FlatBuffer (batch * n * p)) :
                                          FlatBuffer (batch * m * p)

                                          Pure row-major batched matrix multiplication spec.

                                          For each output element C[b,i,j], this folds over k = 0..n-1 using IEEE32Exec.mul followed by IEEE32Exec.add. This is intentionally a specific accumulation order. cuBLAS may use a different tree/FMA strategy, so bit-for-bit agreement with this spec is an explicit native contract, not a free theorem.

                                          Instances For
                                            structure Runtime.Autograd.Cuda.KernelSpec.NativeBmmAgreement {batch m n p : } (nativeBits : NativeBitsBuffer (batch * m * p)) (A : FlatBuffer (batch * m * n)) (B : FlatBuffer (batch * n * p)) :

                                            Agreement assumption for a native BMM implementation.

                                            The scalar result bits must match bmmSpec at every output element. This is intentionally stronger than "numerically close": it is the bitwise contract needed to reuse exact IEEE32Exec proofs.

                                            For cuBLAS-backed kernels this assumption includes:

                                            • row-major TorchLean buffers are interpreted consistently around cuBLAS's column-major GEMM API;
                                            • the accumulation tree/FMA behavior is compatible with the selected reference spec, or the spec is adjusted to the documented cuBLAS/toolchain behavior;
                                            • input and output strides match (batch,m,n), (batch,n,p), and (batch,m,p) row-major layout.
                                            Instances For
                                              theorem Runtime.Autograd.Cuda.KernelSpec.native_bmm_eq_spec {batch m n p : } {nativeBits : NativeBitsBuffer (batch * m * p)} {A : FlatBuffer (batch * m * n)} {B : FlatBuffer (batch * n * p)} (h : NativeBmmAgreement nativeBits A B) :
                                              fromNativeBitsBuffer nativeBits = bmmSpec batch m n p A B

                                              Native BMM inherits the pure row-major bmmSpec when the bitwise agreement contract holds.