Deterministic Reductions Mode #
TorchLean's CUDA runtime uses atomicAdd in a few kernels to accumulate float32 results. This is
fast, but floating-point addition is non-associative, and CUDA does not fix a global order for the
interleaving of atomic updates. As a result, some kernels can be bit-nondeterministic across runs.
TorchLean therefore exposes an opt-in deterministic mode that replaces those atomic accumulation paths with fixed-order reductions. This trades performance for reproducibility.
This flag is a runtime setting affecting only the CUDA/stub backends; it has no effect on the pure Lean Spec.
Enable/disable deterministic reductions mode and return the observed flag value.
Why this helper exists: the raw setter returns Unit, so if you write let _ := set... in
Lean, the compiler is free (under pure semantics) to reorder or eliminate that call. The runtime
therefore provides a *_checked wrapper that both sets the flag and returns the observed value,
giving us a single call with an explicit return value dependency.
Instances For
Enable/disable deterministic reductions mode (see module docstring).
Instances For
Query whether deterministic reductions mode is enabled.
Instances For
Allocator Telemetry #
Snapshot of the CUDA buffer allocator.
liveBytes/peakBytes count TorchLean buffers created by this runtime layer. deviceFreeBytes
and deviceTotalBytes come from cudaMemGetInfo in the CUDA build and are 0 in the CPU stub.
Together they let long-running examples distinguish a TorchLean lifetime leak from broader CUDA
memory pressure or fragmentation.
- liveBytes : UInt64
- peakBytes : UInt64
- allocCount : UInt64
- freeCount : UInt64
- deviceFreeBytes : UInt64
- deviceTotalBytes : UInt64
Instances For
Instances For
Read the current CUDA allocator counters.
token is ignored by the native implementation. It exists so call sites that sample repeatedly
can pass a changing value (for example, the training step), preventing Lean from treating repeated
FFI reads as identical pure expressions.
Instances For
Read the current CUDA allocator counters. Prefer allocatorStatsWithToken in repeated loops.
Instances For
Format a byte count as MiB for allocator progress messages.
Instances For
One-line allocator report for progress logs.
Instances For
Create a device buffer by copying from a host FloatArray (casts each element to float32).
This primitive has a pure Lean type, but the native implementation allocates a fresh device buffer.
Runtime code that repeatedly uploads the same host value should prefer ofFloatArrayIO, which adds
an IO token so two uploads cannot be collapsed into the same external object after one is released.
Effectful host-to-device upload.
The token is ignored by C/CUDA. Its purpose is semantic: repeated uploads of the same FloatArray
must still allocate distinct device buffers. Without a changing token, Lean can treat the extern as
a pure expression, which is not the ownership model we want for long eager CUDA training loops.
Instances For
Copy a buffer back to a host FloatArray (casts float32 elements to Float).
Number of float32 elements in the buffer.
Release the device allocation held by a buffer, returning 1 when a live allocation was released.
This is a runtime pressure valve for eager training loops that create many short-lived CUDA buffers. The C finalizer is still safe after an explicit release because the pointer is nulled out.
Release workspace and return keep.
This exists for pure CUDA tape code: because the returned buffer is used downstream, Lean cannot erase the native release call as dead code.
Release a collection of workspace buffers and return keep.
Many CUDA tape formulas create a group of intermediate buffers, then continue with one final result buffer. Threading cleanup through the result keeps ownership local to the formula and avoids waiting for external-object finalizers in long training loops.
Instances For
A CUDA result together with workspace buffers that were needed to compute it.
This is the common ownership shape for eager CUDA formulas. Some forward computations need intermediate buffers again during the backward pass, so the tape keeps those buffers on the node and releases them when the node is retired. Backward formulas use the same shape when they recompute a value only to differentiate through it.
- value : Buffer
Instances For
Return keep after releasing all workspace buffers owned by this result.
Instances For
Return keep after releasing both the result buffer and its workspace buffers.
Instances For
Ask the Lean runtime allocator (mimalloc) to collect abandoned/free pages.
This does not change any TorchLean value. It is a pressure valve for long native eager loops where many short-lived tape closures and external-buffer wrappers are created every step.
In the CUDA build, force = true also releases cached device blocks held by the native buffer pool.
That gives training code a way to trade reuse for returning memory to the CUDA driver at clear
phase boundaries.
Collect the native allocator's free pages.
Instances For
Allocate a length-n buffer filled with zeros.
Deterministic RNG (device-side) #
These are low-level building blocks used by TorchLean's seeded RNG ops (rand_uniform,
bernoulli_mask) when running on the eager CUDA backend.
They use the same SplitMix64-style mixing as TorchLean.Random so results are
deterministic given (seed, counter) and a row-major linear index.
Deterministic U[0,1) generator: returns a length-n buffer (float32) keyed by key.
Absolute value applied pointwise to a CUDA buffer.
Backward for abs: dx = sign(x) * dLdy (with sign(0)=0).
Backward for sqrt.
Uses the TorchLean convention: dx = dLdy * (1 / (2*sqrt(x))) for x > 0, else 0.
Reciprocal: 1/x.
Pointwise maximum of two equal-length CUDA buffers.
Pointwise division of two equal-length CUDA buffers.
Pointwise ReLU activation on a CUDA buffer.
Backward for relu: dx = dLdy where x > 0, else 0.
Elementwise addition (sizes must match).
Elementwise subtraction (sizes must match).
Elementwise multiplication (sizes must match).
Device-to-device copy, implemented as a scale-by-one kernel.
Instances For
Reductions (return a length-1 buffer).