CUDA Tape Operations: Shared Helpers #

Small helpers #

source

def Runtime.Autograd.Cuda.Tape.u32 (n : ℕ) :

Result UInt32

Checked Nat → UInt32 conversion for CUDA boundaries. Errors if n ≥ 2^32.

Instances For

source

def Runtime.Autograd.Cuda.Tape.numelU32 (s : Spec.Shape) :

Result UInt32

Number of elements in a runtime shape, checked for the UInt32 CUDA ABI.

Instances For

source

def Runtime.Autograd.Cuda.Tape.foldRowsColsLastAxis (s : Spec.Shape) :

Result (UInt32 × UInt32)

Fold all leading axes into a row count and keep the last axis as the column count.

CUDA softmax/log-softmax kernels are 2D row kernels. This helper gives the shared convention used for vectors, matrices, and higher-rank tensors: softmax is always along the last axis.

Instances For

source

def Runtime.Autograd.Cuda.Tape.broadcastScalarToShape (g : Buffer) (outShape : Spec.Shape) :

Buffer

Broadcast a scalar CUDA buffer to outShape. Used by scalar reductions during backprop.

Instances For

source

def Runtime.Autograd.Cuda.Tape.sigmoidBuf (x : Buffer) (n : UInt32) :

Buffer

Logistic sigmoid implemented from primitive CUDA elementwise ops.

Instances For

source

def Runtime.Autograd.Cuda.Tape.tanhBuf (x : Buffer) (n : UInt32) :

Buffer

Hyperbolic tangent implemented as (exp(2x)-1)/(exp(2x)+1).

Instances For

source

def Runtime.Autograd.Cuda.Tape.softplusBuf (x : Buffer) (n : UInt32) :

Buffer

Numerically stable softplus: max(x,0) + log(1 + exp(-abs(x))).

Instances For

source

def Runtime.Autograd.Cuda.Tape.rowSoftmaxForward (x : Buffer) (rows cols : UInt32) :

Buffer.WithWorkspace

Row-wise stable softmax.

The returned WithWorkspace records the buffers used to compute the stable formula. The tape keeps those buffers only as long as the node may need them for backprop, then releases them explicitly.

Instances For

source

def Runtime.Autograd.Cuda.Tape.rowSoftmaxBwd (y dLdy : Buffer) (rows cols : UInt32) :

Buffer

Row-wise softmax VJP: dX = y * (dY - sum(dY*y, axis=1)).

Instances For

source

def Runtime.Autograd.Cuda.Tape.rowLogSoftmaxForward (x : Buffer) (rows cols : UInt32) :

Buffer.WithWorkspace

Row-wise stable log-softmax.

This computes x - rowMax - log(sum(exp(x-rowMax))) directly, avoiding the less stable log(softmax(x)) route. As with softmax, the returned workspace buffers belong to the tape node until the backward pass has finished.

Instances For

source

def Runtime.Autograd.Cuda.Tape.rowLogSoftmaxBwd (y dLdy : Buffer) (rows cols : UInt32) :

Buffer

Row-wise log-softmax VJP: dX = dY - exp(y) * sum(dY, axis=1).

Instances For

TorchLean API

NN.Runtime.Autograd.Engine.Cuda.Ops.Core

CUDA Tape Operations: Shared Helpers #

Small helpers #