Deterministic Reductions Mode #
TorchLean's CUDA runtime uses atomicAdd in a few kernels to accumulate float32 results. This is
fast, but floating-point addition is non-associative, and CUDA does not fix a global order for the
interleaving of atomic updates. As a result, some kernels can be bit-nondeterministic across runs.
TorchLean therefore exposes an opt-in deterministic mode that replaces those atomic accumulation paths with fixed-order reductions. This trades performance for reproducibility.
This flag is a runtime setting affecting only the CUDA/stub backends; it has no effect on the pure Lean Spec.
Enable/disable deterministic reductions mode and return the observed flag value.
Why this helper exists: the raw setter returns Unit, so if you write let _ := set... in
Lean, the compiler is free (under pure semantics) to reorder or eliminate that call. The runtime
therefore provides a *_checked wrapper that both sets the flag and returns the observed value,
giving us a single call with an explicit return value dependency.
Instances For
Enable/disable deterministic reductions mode (see module docstring).
Instances For
Query whether deterministic reductions mode is enabled.
Instances For
Create a device buffer by copying from a host FloatArray (casts each element to float32).
Copy a buffer back to a host FloatArray (casts float32 elements to Float).
Number of float32 elements in the buffer.
Release the device allocation held by a buffer, returning 1 when a live allocation was released.
This is a runtime pressure valve for eager training loops that create many short-lived CUDA buffers. The C finalizer is still safe after an explicit release because the pointer is nulled out.
Release scratch and return keep.
This exists for pure CUDA tape code: because the returned buffer is used downstream, Lean cannot erase the native release call as dead code.
Ask the Lean runtime allocator (mimalloc) to collect abandoned/free pages.
This does not change any TorchLean value. It is a pressure valve for long native eager loops where many short-lived tape closures and external-buffer wrappers are created every step.
Collect the native allocator's free pages.
Instances For
Allocate a length-n buffer filled with zeros.
Deterministic RNG (device-side) #
These are low-level building blocks used by TorchLean's seeded RNG ops (rand_uniform,
bernoulli_mask) when running on the eager CUDA backend.
They intentionally use the same SplitMix64-style mixing as TorchLean.Random so results are
deterministic given (seed, counter) and a row-major linear index.
Deterministic U[0,1) generator: returns a length-n buffer (float32) keyed by key.
Unary ops.
Backward for abs: dx = sign(x) * dLdy (with sign(0)=0).
Backward for sqrt.
Uses the TorchLean convention: dx = dLdy * (1 / (2*sqrt(x))) for x > 0, else 0.
Reciprocal: 1/x.
Binary elementwise ops (sizes must match).
Elementwise division.
Elementwise ReLU.
Backward for relu: dx = dLdy where x > 0, else 0.
Elementwise addition (sizes must match).
Elementwise subtraction (sizes must match).
Elementwise multiplication (sizes must match).
Device-to-device copy, implemented as a scale-by-one kernel.
Instances For
Reductions (return a length-1 buffer).