TorchLean API

Docs Home Guide Examples Graphs

NN.Runtime.Autograd.Engine.Cuda.Buffer

Deterministic Reductions Mode #

TorchLean's CUDA runtime uses atomicAdd in a few kernels to accumulate float32 results. This is fast, but floating-point addition is non-associative, and CUDA does not fix a global order for the interleaving of atomic updates. As a result, some kernels can be bit-nondeterministic across runs.

TorchLean therefore exposes an opt-in deterministic mode that replaces those atomic accumulation paths with fixed-order reductions. This trades performance for reproducibility.

This flag is a runtime setting affecting only the CUDA/stub backends; it has no effect on the pure Lean Spec.

@[extern torchlean_cuda_set_deterministic_reductions]

opaque Runtime.Autograd.Cuda.Buffer.setDeterministicReductionsRaw (on : UInt32) :

@[extern torchlean_cuda_get_deterministic_reductions_u]

opaque Runtime.Autograd.Cuda.Buffer.getDeterministicReductionsRaw (u : UInt32) :

@[extern torchlean_cuda_set_deterministic_reductions_checked]

opaque Runtime.Autograd.Cuda.Buffer.setDeterministicReductionsCheckedRaw (on : UInt32) :

def Runtime.Autograd.Cuda.Buffer.setDeterministicReductionsChecked (on : Bool) :

Enable/disable deterministic reductions mode and return the observed flag value.

Why this helper exists: the raw setter returns Unit, so if you write let _ := set... in Lean, the compiler is free (under pure semantics) to reorder or eliminate that call. The runtime therefore provides a *_checked wrapper that both sets the flag and returns the observed value, giving us a single call with an explicit return value dependency.

Instances For

def Runtime.Autograd.Cuda.Buffer.setDeterministicReductions (on : Bool) :

Enable/disable deterministic reductions mode (see module docstring).

Instances For

def Runtime.Autograd.Cuda.Buffer.getDeterministicReductions :

Query whether deterministic reductions mode is enabled.

Instances For

@[extern torchlean_cuda_buffer_of_float_array]

opaque Runtime.Autograd.Cuda.Buffer.ofFloatArray (a : FloatArray) :

Create a device buffer by copying from a host FloatArray (casts each element to float32).

@[extern torchlean_cuda_buffer_to_float_array]

opaque Runtime.Autograd.Cuda.Buffer.toFloatArray (b : Buffer) :

Copy a buffer back to a host FloatArray (casts float32 elements to Float).

@[extern torchlean_cuda_buffer_size]

opaque Runtime.Autograd.Cuda.Buffer.size (b : Buffer) :

Number of float32 elements in the buffer.

@[extern torchlean_cuda_buffer_release]

opaque Runtime.Autograd.Cuda.Buffer.release (b : Buffer) :

Release the device allocation held by a buffer, returning 1 when a live allocation was released.

This is a runtime pressure valve for eager training loops that create many short-lived CUDA buffers. The C finalizer is still safe after an explicit release because the pointer is nulled out.

@[extern torchlean_cuda_buffer_release_then]

opaque Runtime.Autograd.Cuda.Buffer.releaseThen (scratch keep : Buffer) :

Release scratch and return keep.

This exists for pure CUDA tape code: because the returned buffer is used downstream, Lean cannot erase the native release call as dead code.

@[extern torchlean_runtime_collect_allocator]

opaque Runtime.Autograd.Cuda.Buffer.collectAllocatorRaw (force : UInt32) :

Ask the Lean runtime allocator (mimalloc) to collect abandoned/free pages.

This does not change any TorchLean value. It is a pressure valve for long native eager loops where many short-lived tape closures and external-buffer wrappers are created every step.

def Runtime.Autograd.Cuda.Buffer.collectAllocator (force : Bool := true) :

Collect the native allocator's free pages.

Instances For

@[extern torchlean_cuda_buffer_zeros]

opaque Runtime.Autograd.Cuda.Buffer.zeros (n : UInt32) :

Allocate a length-n buffer filled with zeros.

@[extern torchlean_cuda_buffer_full]

opaque Runtime.Autograd.Cuda.Buffer.full (n : UInt32) (v : Float) :

Allocate a length-n buffer filled with v (host Float, cast to float32).

Deterministic RNG (device-side) #

These are low-level building blocks used by TorchLean's seeded RNG ops (rand_uniform, bernoulli_mask) when running on the eager CUDA backend.

They intentionally use the same SplitMix64-style mixing as TorchLean.Random so results are deterministic given (seed, counter) and a row-major linear index.

@[extern torchlean_cuda_buffer_rand_uniform]

opaque Runtime.Autograd.Cuda.Buffer.randUniform (n : UInt32) (key : UInt64) :

Deterministic U[0,1) generator: returns a length-n buffer (float32) keyed by key.

@[extern torchlean_cuda_buffer_bernoulli_mask]

opaque Runtime.Autograd.Cuda.Buffer.bernoulliMask (n : UInt32) (keepProb : Float) (key : UInt64) :

Deterministic {0,1} mask generator: returns a length-n buffer keyed by key.

@[extern torchlean_cuda_buffer_abs]

opaque Runtime.Autograd.Cuda.Buffer.abs (b : Buffer) :

Unary ops.

@[extern torchlean_cuda_buffer_abs_bwd]

opaque Runtime.Autograd.Cuda.Buffer.absBwd (x dLdy : Buffer) :

Backward for abs: dx = sign(x) * dLdy (with sign(0)=0).

@[extern torchlean_cuda_buffer_sqrt]

opaque Runtime.Autograd.Cuda.Buffer.sqrt (b : Buffer) :

@[extern torchlean_cuda_buffer_sqrt_bwd]

opaque Runtime.Autograd.Cuda.Buffer.sqrtBwd (x dLdy : Buffer) :

Backward for sqrt.

Uses the TorchLean convention: dx = dLdy * (1 / (2*sqrt(x))) for x > 0, else 0.

@[extern torchlean_cuda_buffer_exp]

opaque Runtime.Autograd.Cuda.Buffer.exp (b : Buffer) :

@[extern torchlean_cuda_buffer_log]

opaque Runtime.Autograd.Cuda.Buffer.log (b : Buffer) :

@[extern torchlean_cuda_buffer_inv]

opaque Runtime.Autograd.Cuda.Buffer.inv (b : Buffer) :

Reciprocal: 1/x.

@[extern torchlean_cuda_buffer_clamp]

opaque Runtime.Autograd.Cuda.Buffer.clamp (b : Buffer) (lo hi : Float) :

Clamp each element to [lo, hi] (bounds are host Floats).

@[extern torchlean_cuda_buffer_clamp_bwd]

opaque Runtime.Autograd.Cuda.Buffer.clampBwd (x dLdy : Buffer) (lo hi : Float) :

Backward for clamp.

Uses the TorchLean convention: derivative is 1 strictly inside (lo, hi), else 0.

@[extern torchlean_cuda_buffer_max]

opaque Runtime.Autograd.Cuda.Buffer.max (a b : Buffer) :

Binary elementwise ops (sizes must match).

@[extern torchlean_cuda_buffer_max_bwd]

opaque Runtime.Autograd.Cuda.Buffer.maxBwd (a b dLdy : Buffer) :

Buffer × Buffer

Backward for max, returning (dA, dB).

Tie-breaking follows the spec: when a = b, split upstream gradient evenly (0.5) between both.

@[extern torchlean_cuda_buffer_min]

opaque Runtime.Autograd.Cuda.Buffer.min (a b : Buffer) :

@[extern torchlean_cuda_buffer_min_bwd]

opaque Runtime.Autograd.Cuda.Buffer.minBwd (a b dLdy : Buffer) :

Buffer × Buffer

Backward for min, returning (dA, dB).

Tie-breaking follows the spec: when a = b, split upstream gradient evenly (0.5) between both.

@[extern torchlean_cuda_buffer_div]

opaque Runtime.Autograd.Cuda.Buffer.div (a b : Buffer) :

Elementwise division.

@[extern torchlean_cuda_buffer_relu]

opaque Runtime.Autograd.Cuda.Buffer.relu (b : Buffer) :

Elementwise ReLU.

@[extern torchlean_cuda_buffer_relu_bwd]

opaque Runtime.Autograd.Cuda.Buffer.reluBwd (x dLdy : Buffer) :

Backward for relu: dx = dLdy where x > 0, else 0.

@[extern torchlean_cuda_buffer_add]

opaque Runtime.Autograd.Cuda.Buffer.add (a b : Buffer) :

Elementwise addition (sizes must match).

@[extern torchlean_cuda_buffer_sub]

opaque Runtime.Autograd.Cuda.Buffer.sub (a b : Buffer) :

Elementwise subtraction (sizes must match).

@[extern torchlean_cuda_buffer_mul]

opaque Runtime.Autograd.Cuda.Buffer.mul (a b : Buffer) :

Elementwise multiplication (sizes must match).

@[extern torchlean_cuda_buffer_scale]

opaque Runtime.Autograd.Cuda.Buffer.scale (b : Buffer) (c : Float) :

Multiply each element by a scalar c (host Float, cast to float32).

This is a primitive building block for many ops (e.g. scaling gradients).

def Runtime.Autograd.Cuda.Buffer.copy (b : Buffer) :

Device-to-device copy, implemented as a scale-by-one kernel.

Instances For

@[extern torchlean_cuda_buffer_axpy]

opaque Runtime.Autograd.Cuda.Buffer.axpy (a b : Buffer) (c : Float) :

Fused multiply-add: a + c * b (sizes must match; c is a host Float, cast to float32).

This is the classic BLAS-style axpy primitive and is useful for optimizers and bias-like updates.

@[extern torchlean_cuda_buffer_reduce_sum]

opaque Runtime.Autograd.Cuda.Buffer.reduceSum (b : Buffer) :

Reductions (return a length-1 buffer).

@[extern torchlean_cuda_buffer_reduce_mean]

opaque Runtime.Autograd.Cuda.Buffer.reduceMean (b : Buffer) :