CUDA float32 contract #
TorchLean's CUDA eager backend stores native float values in an opaque FFI buffer. Lean cannot
look inside CUDA kernels, C casts, libdevice calls, or cuBLAS, so the native backend is necessarily a
trusted/validated implementation boundary.
This module keeps that boundary precise:
IEEE32Execis the executable, bit-level reference model for scalar binary32.- host
Floatinputs enter the float32 world throughIEEE32Exec.ofFloat, matching the intended "round binary64 host literals to binary32" contract; - external/native CUDA scalar results are represented only by their raw 32-bit result bits;
- if those native bits agree with the
IEEE32Execreference op, then the existing provedIEEE32Exec → FP32-on-ℝtheorems apply immediately.
In other words, the proof stack is:
native CUDA bits --(explicit agreement assumption / tests / toolchain contract)-->
IEEE32Exec --(proved in Lean)--> FP32 rounding-on-ℝ error bounds.
What is not proved here:
- that a particular compiled CUDA kernel, C compiler, device, libdevice implementation, or cuBLAS version produces the reference bits;
- deterministic ordering for atomic reductions unless the backend uses a fixed reduction tree;
- correct-rounding for transcendental functions that IEEE-754 itself does not specify.
Those are runtime/toolchain assumptions, and the CUDA stress tests are intended to validate them against this reference contract.
Reference scalar and host conversion #
The scalar reference for CUDA float32 reasoning.
CUDA buffers are opaque to Lean; this is the scalar model we compare their 32-bit elements against.
Instances For
Interpret raw native binary32 bits as the IEEE32Exec reference scalar.
Instances For
Extract the binary32 bit pattern used for native/reference comparisons.
Instances For
Host Float upload is exactly the IEEE32Exec.ofFloat conversion.
Host Float upload bits are exactly the reference binary32 conversion bits.
Runtime Lean Float32 values can also be reinterpreted as the same bit-level reference scalar.
Abstract native scalar semantics #
Abstract result bits for native CUDA scalar primitives.
This deliberately does not claim that CUDA has been proved correct in Lean. It provides an explicit comparison point where
the FFI/runtime implementation can be compared against the IEEE32Exec reference, one result bit
pattern at a time. Vector/tensor kernels lift this elementwise, except reductions whose order must
also be specified.
Instances For
The trusted/validated CUDA scalar agreement assumption.
For a concrete CUDA build, these fields are what parity tests, compiler flags, and backend policy are
checking: native result bits match the executable IEEE32Exec reference for primitive float32 ops.
- add_bits (x y : RefScalar) : native.addBits x y = toNativeBits (TorchLean.Floats.IEEE754.IEEE32Exec.add x y)
- mul_bits (x y : RefScalar) : native.mulBits x y = toNativeBits (TorchLean.Floats.IEEE754.IEEE32Exec.mul x y)
- div_bits (x y : RefScalar) : native.divBits x y = toNativeBits (TorchLean.Floats.IEEE754.IEEE32Exec.div x y)
- fma_bits (x y z : RefScalar) : native.fmaBits x y z = toNativeBits (TorchLean.Floats.IEEE754.IEEE32Exec.fma x y z)
- sqrt_bits (x : RefScalar) : native.sqrtBits x = toNativeBits (TorchLean.Floats.IEEE754.IEEE32Exec.sqrt x)
Instances For
Native addition agrees with the reference value when its result bits satisfy the contract.
Native multiplication agrees with the reference value when its result bits satisfy the contract.
Native division agrees with the reference value when its result bits satisfy the contract.
Native fused multiply-add agrees with the reference value when its result bits satisfy the contract.
Native square root agrees with the reference value when its result bits satisfy the contract.
Inheriting the proved IEEE32Exec → FP32 bounds #
If native CUDA addition matches the IEEE32Exec result bits and the result is finite, then it has
the standard binary32 half-ULP absolute error bound against real addition.
If native CUDA multiplication matches the IEEE32Exec result bits and the result is finite, then it
has the standard binary32 half-ULP absolute error bound against real multiplication.
If native CUDA division matches the IEEE32Exec result bits and the result is finite, then it has
the standard binary32 half-ULP absolute error bound against real division.
If native CUDA FMA matches the IEEE32Exec result bits and the result is finite, then it has the
standard binary32 half-ULP absolute error bound against real x*y+z.
If native CUDA square root matches the IEEE32Exec result bits and the result is finite, then it
has the standard binary32 half-ULP absolute error bound against real square root.