TorchLean API

NN.Runtime.Autograd.Engine.Cuda.Buffer

Deterministic Reductions Mode #

TorchLean's CUDA runtime uses atomicAdd in a few kernels to accumulate float32 results. This is fast, but floating-point addition is non-associative, and CUDA does not fix a global order for the interleaving of atomic updates. As a result, some kernels can be bit-nondeterministic across runs.

TorchLean therefore exposes an opt-in deterministic mode that replaces those atomic accumulation paths with fixed-order reductions. This trades performance for reproducibility.

This flag is a runtime setting affecting only the CUDA/stub backends; it has no effect on the pure Lean Spec.

@[extern torchlean_cuda_set_deterministic_reductions]
@[extern torchlean_cuda_get_deterministic_reductions_u]
@[extern torchlean_cuda_set_deterministic_reductions_checked]

Enable/disable deterministic reductions mode and return the observed flag value.

Why this helper exists: the raw setter returns Unit, so if you write let _ := set... in Lean, the compiler is free (under pure semantics) to reorder or eliminate that call. The runtime therefore provides a *_checked wrapper that both sets the flag and returns the observed value, giving us a single call with an explicit return value dependency.

Instances For

    Enable/disable deterministic reductions mode (see module docstring).

    Instances For

      Query whether deterministic reductions mode is enabled.

      Instances For
        @[extern torchlean_cuda_buffer_of_float_array]

        Create a device buffer by copying from a host FloatArray (casts each element to float32).

        @[extern torchlean_cuda_buffer_to_float_array]

        Copy a buffer back to a host FloatArray (casts float32 elements to Float).

        @[extern torchlean_cuda_buffer_size]

        Number of float32 elements in the buffer.

        @[extern torchlean_cuda_buffer_release]

        Release the device allocation held by a buffer, returning 1 when a live allocation was released.

        This is a runtime pressure valve for eager training loops that create many short-lived CUDA buffers. The C finalizer is still safe after an explicit release because the pointer is nulled out.

        @[extern torchlean_cuda_buffer_release_then]

        Release scratch and return keep.

        This exists for pure CUDA tape code: because the returned buffer is used downstream, Lean cannot erase the native release call as dead code.

        @[extern torchlean_runtime_collect_allocator]

        Ask the Lean runtime allocator (mimalloc) to collect abandoned/free pages.

        This does not change any TorchLean value. It is a pressure valve for long native eager loops where many short-lived tape closures and external-buffer wrappers are created every step.

        Collect the native allocator's free pages.

        Instances For
          @[extern torchlean_cuda_buffer_zeros]

          Allocate a length-n buffer filled with zeros.

          @[extern torchlean_cuda_buffer_full]

          Allocate a length-n buffer filled with v (host Float, cast to float32).

          Deterministic RNG (device-side) #

          These are low-level building blocks used by TorchLean's seeded RNG ops (rand_uniform, bernoulli_mask) when running on the eager CUDA backend.

          They intentionally use the same SplitMix64-style mixing as TorchLean.Random so results are deterministic given (seed, counter) and a row-major linear index.

          @[extern torchlean_cuda_buffer_rand_uniform]

          Deterministic U[0,1) generator: returns a length-n buffer (float32) keyed by key.

          @[extern torchlean_cuda_buffer_bernoulli_mask]

          Deterministic {0,1} mask generator: returns a length-n buffer keyed by key.

          @[extern torchlean_cuda_buffer_abs]

          Unary ops.

          @[extern torchlean_cuda_buffer_abs_bwd]

          Backward for abs: dx = sign(x) * dLdy (with sign(0)=0).

          @[extern torchlean_cuda_buffer_sqrt]
          @[extern torchlean_cuda_buffer_sqrt_bwd]

          Backward for sqrt.

          Uses the TorchLean convention: dx = dLdy * (1 / (2*sqrt(x))) for x > 0, else 0.

          @[extern torchlean_cuda_buffer_exp]
          @[extern torchlean_cuda_buffer_log]
          @[extern torchlean_cuda_buffer_inv]

          Reciprocal: 1/x.

          @[extern torchlean_cuda_buffer_clamp]

          Clamp each element to [lo, hi] (bounds are host Floats).

          @[extern torchlean_cuda_buffer_clamp_bwd]

          Backward for clamp.

          Uses the TorchLean convention: derivative is 1 strictly inside (lo, hi), else 0.

          @[extern torchlean_cuda_buffer_max]

          Binary elementwise ops (sizes must match).

          @[extern torchlean_cuda_buffer_max_bwd]

          Backward for max, returning (dA, dB).

          Tie-breaking follows the spec: when a = b, split upstream gradient evenly (0.5) between both.

          @[extern torchlean_cuda_buffer_min]
          @[extern torchlean_cuda_buffer_min_bwd]

          Backward for min, returning (dA, dB).

          Tie-breaking follows the spec: when a = b, split upstream gradient evenly (0.5) between both.

          @[extern torchlean_cuda_buffer_div]

          Elementwise division.

          @[extern torchlean_cuda_buffer_relu]

          Elementwise ReLU.

          @[extern torchlean_cuda_buffer_relu_bwd]

          Backward for relu: dx = dLdy where x > 0, else 0.

          @[extern torchlean_cuda_buffer_add]

          Elementwise addition (sizes must match).

          @[extern torchlean_cuda_buffer_sub]

          Elementwise subtraction (sizes must match).

          @[extern torchlean_cuda_buffer_mul]

          Elementwise multiplication (sizes must match).

          @[extern torchlean_cuda_buffer_scale]

          Multiply each element by a scalar c (host Float, cast to float32).

          This is a primitive building block for many ops (e.g. scaling gradients).

          Device-to-device copy, implemented as a scale-by-one kernel.

          Instances For
            @[extern torchlean_cuda_buffer_axpy]

            Fused multiply-add: a + c * b (sizes must match; c is a host Float, cast to float32).

            This is the classic BLAS-style axpy primitive and is useful for optimizers and bias-like updates.

            @[extern torchlean_cuda_buffer_reduce_sum]

            Reductions (return a length-1 buffer).

            @[extern torchlean_cuda_buffer_reduce_mean]