TorchLean API

NN.Runtime.Autograd.Engine.Cuda.Buffer

Deterministic Reductions Mode #

TorchLean's CUDA runtime uses atomicAdd in a few kernels to accumulate float32 results. This is fast, but floating-point addition is non-associative, and CUDA does not fix a global order for the interleaving of atomic updates. As a result, some kernels can be bit-nondeterministic across runs.

TorchLean therefore exposes an opt-in deterministic mode that replaces those atomic accumulation paths with fixed-order reductions. This trades performance for reproducibility.

This flag is a runtime setting affecting only the CUDA/stub backends; it has no effect on the pure Lean Spec.

@[extern torchlean_cuda_set_deterministic_reductions]
@[extern torchlean_cuda_get_deterministic_reductions_u]
@[extern torchlean_cuda_set_deterministic_reductions_checked]

Enable/disable deterministic reductions mode and return the observed flag value.

Why this helper exists: the raw setter returns Unit, so if you write let _ := set... in Lean, the compiler is free (under pure semantics) to reorder or eliminate that call. The runtime therefore provides a *_checked wrapper that both sets the flag and returns the observed value, giving us a single call with an explicit return value dependency.

Instances For

    Enable/disable deterministic reductions mode (see module docstring).

    Instances For

      Query whether deterministic reductions mode is enabled.

      Instances For

        Allocator Telemetry #

        @[extern torchlean_cuda_allocator_live_bytes]
        @[extern torchlean_cuda_allocator_peak_bytes]
        @[extern torchlean_cuda_allocator_alloc_count]
        @[extern torchlean_cuda_allocator_free_count]
        @[extern torchlean_cuda_allocator_device_free_bytes]
        @[extern torchlean_cuda_allocator_device_total_bytes]

        Snapshot of the CUDA buffer allocator.

        liveBytes/peakBytes count TorchLean buffers created by this runtime layer. deviceFreeBytes and deviceTotalBytes come from cudaMemGetInfo in the CUDA build and are 0 in the CPU stub. Together they let long-running examples distinguish a TorchLean lifetime leak from broader CUDA memory pressure or fragmentation.

        Instances For

          Read the current CUDA allocator counters.

          token is ignored by the native implementation. It exists so call sites that sample repeatedly can pass a changing value (for example, the training step), preventing Lean from treating repeated FFI reads as identical pure expressions.

          Instances For

            Read the current CUDA allocator counters. Prefer allocatorStatsWithToken in repeated loops.

            Instances For

              Format a byte count as MiB for allocator progress messages.

              Instances For

                One-line allocator report for progress logs.

                Instances For
                  @[extern torchlean_cuda_buffer_of_float_array]

                  Create a device buffer by copying from a host FloatArray (casts each element to float32).

                  This primitive has a pure Lean type, but the native implementation allocates a fresh device buffer. Runtime code that repeatedly uploads the same host value should prefer ofFloatArrayIO, which adds an IO token so two uploads cannot be collapsed into the same external object after one is released.

                  @[extern torchlean_cuda_buffer_of_float_array_with_token]

                  Effectful host-to-device upload.

                  The token is ignored by C/CUDA. Its purpose is semantic: repeated uploads of the same FloatArray must still allocate distinct device buffers. Without a changing token, Lean can treat the extern as a pure expression, which is not the ownership model we want for long eager CUDA training loops.

                  Instances For
                    @[extern torchlean_cuda_buffer_to_float_array]

                    Copy a buffer back to a host FloatArray (casts float32 elements to Float).

                    @[extern torchlean_cuda_buffer_size]

                    Number of float32 elements in the buffer.

                    @[extern torchlean_cuda_buffer_release]

                    Release the device allocation held by a buffer, returning 1 when a live allocation was released.

                    This is a runtime pressure valve for eager training loops that create many short-lived CUDA buffers. The C finalizer is still safe after an explicit release because the pointer is nulled out.

                    @[extern torchlean_cuda_buffer_release_then]

                    Release workspace and return keep.

                    This exists for pure CUDA tape code: because the returned buffer is used downstream, Lean cannot erase the native release call as dead code.

                    Release a collection of workspace buffers and return keep.

                    Many CUDA tape formulas create a group of intermediate buffers, then continue with one final result buffer. Threading cleanup through the result keeps ownership local to the formula and avoids waiting for external-object finalizers in long training loops.

                    Instances For

                      A CUDA result together with workspace buffers that were needed to compute it.

                      This is the common ownership shape for eager CUDA formulas. Some forward computations need intermediate buffers again during the backward pass, so the tape keeps those buffers on the node and releases them when the node is retired. Backward formulas use the same shape when they recompute a value only to differentiate through it.

                      Instances For

                        Return keep after releasing all workspace buffers owned by this result.

                        Instances For

                          Return keep after releasing both the result buffer and its workspace buffers.

                          Instances For
                            @[extern torchlean_runtime_collect_allocator]

                            Ask the Lean runtime allocator (mimalloc) to collect abandoned/free pages.

                            This does not change any TorchLean value. It is a pressure valve for long native eager loops where many short-lived tape closures and external-buffer wrappers are created every step.

                            In the CUDA build, force = true also releases cached device blocks held by the native buffer pool. That gives training code a way to trade reuse for returning memory to the CUDA driver at clear phase boundaries.

                            Collect the native allocator's free pages.

                            Instances For
                              @[extern torchlean_cuda_buffer_zeros]

                              Allocate a length-n buffer filled with zeros.

                              @[extern torchlean_cuda_buffer_full]

                              Allocate a length-n buffer filled with v (host Float, cast to float32).

                              Deterministic RNG (device-side) #

                              These are low-level building blocks used by TorchLean's seeded RNG ops (rand_uniform, bernoulli_mask) when running on the eager CUDA backend.

                              They use the same SplitMix64-style mixing as TorchLean.Random so results are deterministic given (seed, counter) and a row-major linear index.

                              @[extern torchlean_cuda_buffer_rand_uniform]

                              Deterministic U[0,1) generator: returns a length-n buffer (float32) keyed by key.

                              @[extern torchlean_cuda_buffer_bernoulli_mask]

                              Deterministic {0,1} mask generator: returns a length-n buffer keyed by key.

                              @[extern torchlean_cuda_buffer_abs]

                              Absolute value applied pointwise to a CUDA buffer.

                              @[extern torchlean_cuda_buffer_abs_bwd]

                              Backward for abs: dx = sign(x) * dLdy (with sign(0)=0).

                              @[extern torchlean_cuda_buffer_sqrt]
                              @[extern torchlean_cuda_buffer_sqrt_bwd]

                              Backward for sqrt.

                              Uses the TorchLean convention: dx = dLdy * (1 / (2*sqrt(x))) for x > 0, else 0.

                              @[extern torchlean_cuda_buffer_exp]
                              @[extern torchlean_cuda_buffer_log]
                              @[extern torchlean_cuda_buffer_inv]

                              Reciprocal: 1/x.

                              @[extern torchlean_cuda_buffer_clamp]

                              Clamp each element to [lo, hi] (bounds are host Floats).

                              @[extern torchlean_cuda_buffer_clamp_bwd]

                              Backward for clamp.

                              Uses the TorchLean convention: derivative is 1 strictly inside (lo, hi), else 0.

                              @[extern torchlean_cuda_buffer_max]

                              Pointwise maximum of two equal-length CUDA buffers.

                              @[extern torchlean_cuda_buffer_max_bwd]

                              Backward for max, returning (dA, dB).

                              Tie-breaking follows the spec: when a = b, split upstream gradient evenly (0.5) between both.

                              @[extern torchlean_cuda_buffer_min]
                              @[extern torchlean_cuda_buffer_min_bwd]

                              Backward for min, returning (dA, dB).

                              Tie-breaking follows the spec: when a = b, split upstream gradient evenly (0.5) between both.

                              @[extern torchlean_cuda_buffer_div]

                              Pointwise division of two equal-length CUDA buffers.

                              @[extern torchlean_cuda_buffer_relu]

                              Pointwise ReLU activation on a CUDA buffer.

                              @[extern torchlean_cuda_buffer_relu_bwd]

                              Backward for relu: dx = dLdy where x > 0, else 0.

                              @[extern torchlean_cuda_buffer_add]

                              Elementwise addition (sizes must match).

                              @[extern torchlean_cuda_buffer_sub]

                              Elementwise subtraction (sizes must match).

                              @[extern torchlean_cuda_buffer_mul]

                              Elementwise multiplication (sizes must match).

                              @[extern torchlean_cuda_buffer_scale]

                              Multiply each element by a scalar c (host Float, cast to float32).

                              This is a primitive building block for many ops (e.g. scaling gradients).

                              Device-to-device copy, implemented as a scale-by-one kernel.

                              Instances For
                                @[extern torchlean_cuda_buffer_axpy]

                                Fused multiply-add: a + c * b (sizes must match; c is a host Float, cast to float32).

                                This is the classic BLAS-style axpy primitive and is useful for optimizers and bias-like updates.

                                @[extern torchlean_cuda_buffer_reduce_sum]

                                Reductions (return a length-1 buffer).

                                @[extern torchlean_cuda_buffer_reduce_mean]