TorchLean API

NN.Runtime.Autograd.Engine.Cuda.Ops.Core

CUDA Tape Operations: Shared Helpers #

Small helpers #

Checked NatUInt32 conversion for CUDA boundaries. Errors if n ≥ 2^32.

Instances For

    Number of elements in a runtime shape, checked for the UInt32 CUDA ABI.

    Instances For

      Fold all leading axes into a row count and keep the last axis as the column count.

      CUDA softmax/log-softmax kernels are 2D row kernels. This helper gives the shared convention used for vectors, matrices, and higher-rank tensors: softmax is always along the last axis.

      Instances For

        Broadcast a scalar CUDA buffer to outShape. Used by scalar reductions during backprop.

        Instances For

          Logistic sigmoid implemented from primitive CUDA elementwise ops.

          Instances For

            Hyperbolic tangent implemented as (exp(2x)-1)/(exp(2x)+1).

            Instances For

              Numerically stable softplus: max(x,0) + log(1 + exp(-abs(x))).

              Instances For

                Row-wise stable softmax.

                The returned WithWorkspace records the buffers used to compute the stable formula. The tape keeps those buffers only as long as the node may need them for backprop, then releases them explicitly.

                Instances For

                  Row-wise softmax VJP: dX = y * (dY - sum(dY*y, axis=1)).

                  Instances For

                    Row-wise stable log-softmax.

                    This computes x - rowMax - log(sum(exp(x-rowMax))) directly, avoiding the less stable log(softmax(x)) route. As with softmax, the returned workspace buffers belong to the tape node until the backward pass has finished.

                    Instances For

                      Row-wise log-softmax VJP: dX = dY - exp(y) * sum(dY, axis=1).

                      Instances For