TorchLean API

def Runtime.Autograd.Cuda.Tape.u32 (n : ℕ) :

Result UInt32

Checked Nat → UInt32 conversion for CUDA boundaries. Errors if n ≥ 2^32.

Instances For

def Runtime.Autograd.Cuda.Tape.numelU32 (s : Spec.Shape) :

Result UInt32

Number of elements in a runtime shape, checked for the UInt32 CUDA ABI.

Instances For

def Runtime.Autograd.Cuda.Tape.foldRowsColsLastAxis (s : Spec.Shape) :

Result (UInt32 × UInt32)

Fold all leading axes into a row count and keep the last axis as the column count.

CUDA softmax/log-softmax kernels are 2D row kernels. This helper gives the shared convention used for vectors, matrices, and higher-rank tensors: softmax is always along the last axis.

Instances For

def Runtime.Autograd.Cuda.Tape.broadcastScalarToShape (g : Buffer) (outShape : Spec.Shape) :

Broadcast a scalar CUDA buffer to outShape. Used by scalar reductions during backprop.

Instances For

def Runtime.Autograd.Cuda.Tape.sigmoidBuf (x : Buffer) (n : UInt32) :

Logistic sigmoid implemented from primitive CUDA elementwise ops.

Instances For

def Runtime.Autograd.Cuda.Tape.tanhBuf (x : Buffer) (n : UInt32) :

Hyperbolic tangent implemented as (exp(2x)-1)/(exp(2x)+1).

Instances For

def Runtime.Autograd.Cuda.Tape.softplusBuf (x : Buffer) (n : UInt32) :

Numerically stable softplus: max(x,0) + log(1 + exp(-abs(x))).

Instances For

def Runtime.Autograd.Cuda.Tape.softmax2dFwdWithCleanup (x : Buffer) (rows cols : UInt32) :

Buffer × List Buffer

Row-wise stable softmax plus scratch buffers that should be released after the backward pass.

The output is the first component. The scratch list records temporary buffers created while computing the stable formula so long training runs do not wait for native finalizers.

Instances For

def Runtime.Autograd.Cuda.Tape.softmax2dFwd (x : Buffer) (rows cols : UInt32) :

Instances For

def Runtime.Autograd.Cuda.Tape.softmax2dBwd (y dLdy : Buffer) (rows cols : UInt32) :

Row-wise softmax VJP: dX = y * (dY - sum(dY*y, axis=1)).

Instances For

def Runtime.Autograd.Cuda.Tape.logSoftmax2dFwdWithCleanup (x : Buffer) (rows cols : UInt32) :

Buffer × List Buffer

Row-wise stable log-softmax plus scratch buffers that should be released after backprop.

This computes x - rowMax - log(sum(exp(x-rowMax))) directly, avoiding the less stable log(softmax(x)) route.

Instances For

def Runtime.Autograd.Cuda.Tape.logSoftmax2dFwd (x : Buffer) (rows cols : UInt32) :

Instances For

def Runtime.Autograd.Cuda.Tape.logSoftmax2dBwd (y dLdy : Buffer) (rows cols : UInt32) :

Row-wise log-softmax VJP: dX = dY - exp(y) * sum(dY, axis=1).

Instances For

Elementwise ops #

def Runtime.Autograd.Cuda.Tape.add {s : Spec.Shape} (t : Tape) (aId bId : ℕ) :

Elementwise addition.

Instances For

def Runtime.Autograd.Cuda.Tape.sub {s : Spec.Shape} (t : Tape) (aId bId : ℕ) :

Elementwise subtraction.

Instances For

def Runtime.Autograd.Cuda.Tape.mul {s : Spec.Shape} (t : Tape) (aId bId : ℕ) :

Elementwise multiplication.

Instances For

def Runtime.Autograd.Cuda.Tape.scale {s : Spec.Shape} (t : Tape) (xId : ℕ) (c : Float) :

Multiply by a scalar constant.

Instances For

def Runtime.Autograd.Cuda.Tape.abs {s : Spec.Shape} (t : Tape) (xId : ℕ) :

Elementwise abs.

Instances For

def Runtime.Autograd.Cuda.Tape.sqrt {s : Spec.Shape} (t : Tape) (xId : ℕ) :

Elementwise sqrt.

Instances For

def Runtime.Autograd.Cuda.Tape.clamp {s : Spec.Shape} (t : Tape) (xId : ℕ) (lo hi : Float) :

Clamp each element to [lo, hi].

Instances For

def Runtime.Autograd.Cuda.Tape.max {s : Spec.Shape} (t : Tape) (aId bId : ℕ) :

Elementwise max.

Instances For

def Runtime.Autograd.Cuda.Tape.min {s : Spec.Shape} (t : Tape) (aId bId : ℕ) :

Elementwise min.

Instances For

def Runtime.Autograd.Cuda.Tape.div {s : Spec.Shape} (t : Tape) (aId bId : ℕ) :

Elementwise division.

Instances For

def Runtime.Autograd.Cuda.Tape.relu {s : Spec.Shape} (t : Tape) (xId : ℕ) :

Elementwise ReLU.

Instances For

def Runtime.Autograd.Cuda.Tape.exp {s : Spec.Shape} (t : Tape) (xId : ℕ) :

Elementwise exp.

Instances For

def Runtime.Autograd.Cuda.Tape.log {s : Spec.Shape} (t : Tape) (xId : ℕ) :

Elementwise log.

Instances For

def Runtime.Autograd.Cuda.Tape.inv {s : Spec.Shape} (t : Tape) (xId : ℕ) :

Elementwise reciprocal 1/x.

Instances For

def Runtime.Autograd.Cuda.Tape.safeLog {s : Spec.Shape} (t : Tape) (xId : ℕ) (ε : Float) :

Elementwise "safe log" that protects against log(0) by adding a small ε internally.

Spec semantics: log(softplus(x) + ε).

Instances For

def Runtime.Autograd.Cuda.Tape.sigmoid {s : Spec.Shape} (t : Tape) (xId : ℕ) :

Elementwise sigmoid (logistic).

Instances For

def Runtime.Autograd.Cuda.Tape.tanh {s : Spec.Shape} (t : Tape) (xId : ℕ) :

Elementwise tanh.

Instances For

def Runtime.Autograd.Cuda.Tape.softplus {s : Spec.Shape} (t : Tape) (xId : ℕ) :

Elementwise softplus.

Instances For

Reductions / views #

def Runtime.Autograd.Cuda.Tape.sum {s : Spec.Shape} (t : Tape) (xId : ℕ) :

Reduce-sum of all entries, producing a scalar.

Instances For

def Runtime.Autograd.Cuda.Tape.flatten {s : Spec.Shape} (t : Tape) (xId : ℕ) :

Flatten s into a 1D vector of length Shape.size s.

Instances For

def Runtime.Autograd.Cuda.Tape.reshape {s₁ s₂ : Spec.Shape} (t : Tape) (xId : ℕ) (_h : s₁.size = s₂.size) :

Reshape a buffer while preserving number of elements.

This is a no-copy view operation: it reuses the same contiguous buffer.

Instances For

def Runtime.Autograd.Cuda.Tape.transpose2d {m n : ℕ} (t : Tape) (xId : ℕ) :

Transpose a 2D buffer.

Instances For

def Runtime.Autograd.Cuda.Tape.swapAdjacentAtDepth {s : Spec.Shape} (t : Tape) (depth xId : ℕ) :

Swap adjacent axes at a given depth in an N-D buffer.

If depth is out of range, this is treated as the identity (matches the spec-layer helper).

Instances For

def Runtime.Autograd.Cuda.Tape.transpose3dFirstToLast {a b c : ℕ} (t : Tape) (xId : ℕ) :

Permute a 3D tensor (a,b,c) → (b,c,a).

Instances For

def Runtime.Autograd.Cuda.Tape.transpose3dLastToFirst {a b c : ℕ} (t : Tape) (xId : ℕ) :

Permute a 3D tensor (a,b,c) → (c,a,b).

Instances For

def Runtime.Autograd.Cuda.Tape.transpose3dLastTwo {a b c : ℕ} (t : Tape) (xId : ℕ) :

Swap the last two axes of a 3D tensor (a,b,c) → (a,c,b).

Instances For

def Runtime.Autograd.Cuda.Tape.broadcastTo {s₁ s₂ : Spec.Shape} (t : Tape) (cb : s₁.CanBroadcastTo s₂) (xId : ℕ) :

Broadcast x : s₁ to s₂.

Forward: broadcastTo. Backward: sum-reduce broadcasted axes (reduceFromBroadcastTo).

Instances For

def Runtime.Autograd.Cuda.Tape.reduceSum {s : Spec.Shape} (axis : ℕ) [valid : Spec.Shape.valid_axis_inst axis s] [wf : s.WellFormed] (t : Tape) (xId : ℕ) :

Reduce-sum along axis.

Instances For

def Runtime.Autograd.Cuda.Tape.reduceMean {s : Spec.Shape} (axis : ℕ) [valid : Spec.Shape.valid_axis_inst axis s] [wf : s.WellFormed] (t : Tape) (xId : ℕ) :

Reduce-mean along axis.

Instances For

Linear algebra #

def Runtime.Autograd.Cuda.Tape.matmul {m n p : ℕ} (t : Tape) (aId bId : ℕ) :

2D matrix multiply.

Instances For

def Runtime.Autograd.Cuda.Tape.bmm {batch m n p : ℕ} (t : Tape) (aId bId : ℕ) :

Batched matrix multiply.

Instances For

def Runtime.Autograd.Cuda.Tape.spectralConv1dRfft {grid width modes : ℕ} (t : Tape) (xId wReId wImId : ℕ) :

Fused real-FFT spectral convolution used by the CUDA FNO1D path.

Shapes:

x : (grid, width),
wRe, wIm : (modes, width, width),
output y : (grid, width).

The low-level buffer primitive owns the numerical contract and VJP: rfft(x) is unnormalized, the inverse is normalized, and the backward kernels include the half-spectrum adjoint factors for real FFTs. This tape node simply records those three parent dependencies and checks the runtime shapes before calling the native kernels.

Instances For

Linear layer / losses #

def Runtime.Autograd.Cuda.Tape.linear {outDim inDim : ℕ} (t : Tape) (wId bId xId : ℕ) :

Linear layer: y = W·x + b with W : (outDim,inDim), x : inDim, b : outDim.

Instances For

def Runtime.Autograd.Cuda.Tape.mseLoss {s : Spec.Shape} (t : Tape) (yhatId targetId : ℕ) :

Mean-squared-error loss with "mean" reduction (single scalar output).

Instances For

Concat / slice (1D) #

def Runtime.Autograd.Cuda.Tape.concat1d {n m : ℕ} (t : Tape) (aId bId : ℕ) :

Concatenate two 1D buffers.

Instances For

def Runtime.Autograd.Cuda.Tape.concatVectors {n m : ℕ} (t : Tape) (aId bId : ℕ) :

Concatenate two 1D tensors (CPU tape name).

Instances For

def Runtime.Autograd.Cuda.Tape.slice1d {n start len : ℕ} (t : Tape) (xId : ℕ) :

Slice a 1D buffer.

Instances For

Concat / slice along dim 0 #

def Runtime.Autograd.Cuda.Tape.concatDim0 {n m : ℕ} {s : Spec.Shape} (t : Tape) (aId bId : ℕ) :

Concatenate along dim 0 for tensors with leading dimension (CPU tape name).

Instances For

def Runtime.Autograd.Cuda.Tape.sliceRange0 {n : ℕ} {s : Spec.Shape} (t : Tape) (xId start len : ℕ) (_h : len + start ≤ n) :

Slice along dim 0: x[start:start+len] (CPU tape name).

Instances For

Gather / scatter (host Nat indices) #

Indices are non-differentiable and remain on the host. Kernels totalize out-of-bounds indices as documented in NN.Runtime.Autograd.Engine.Cuda.Kernels.

def Runtime.Autograd.Cuda.Tape.gatherScalar {n : ℕ} (t : Tape) (xId : ℕ) (i : Fin n) :

Gather a scalar from a 1D vector using a compile-time index.

Instances For

def Runtime.Autograd.Cuda.Tape.gatherRow {rows cols : ℕ} (t : Tape) (xId : ℕ) (i : Fin rows) :

Gather a row from a 2D matrix using a compile-time index.

Instances For

def Runtime.Autograd.Cuda.Tape.gatherScalarNat {n : ℕ} (t : Tape) (xId i : ℕ) :

Gather a scalar from a 1D vector using a runtime Nat index (totalized by the kernel).

Instances For

def Runtime.Autograd.Cuda.Tape.natTensorToIndexArray {k : ℕ} (idx : Spec.Tensor ℕ (Spec.Shape.dim k Spec.Shape.scalar)) :

Array ℕ

Instances For

def Runtime.Autograd.Cuda.Tape.gatherVecNat {n k : ℕ} (t : Tape) (xId : ℕ) (idx : Spec.Tensor ℕ (Spec.Shape.dim k Spec.Shape.scalar)) :

Gather k scalars from a length-n vector.

Instances For

def Runtime.Autograd.Cuda.Tape.gatherRowsNat {rows cols k : ℕ} (t : Tape) (xId : ℕ) (idx : Spec.Tensor ℕ (Spec.Shape.dim k Spec.Shape.scalar)) :

Gather k rows from a (rows, cols) matrix (row-major).

Instances For

def Runtime.Autograd.Cuda.Tape.scatterAddVec {n : ℕ} (t : Tape) (xId vId : ℕ) (i : Fin n) :

Scatter-add into a vector: out = x with out[i] += v.

Instances For

def Runtime.Autograd.Cuda.Tape.scatterAddRow {rows cols : ℕ} (t : Tape) (xId vId : ℕ) (i : Fin rows) :

Scatter-add into a matrix row: out = x with out[i,:] += v.

Instances For

Conv2D + pooling (ConvPool FFI) #

def Runtime.Autograd.Cuda.Tape.conv2d {inC outC kH kW stride padding inH inW : ℕ} {h1 : inC ≠ 0} {h2 : kH ≠ 0} {h3 : kW ≠ 0} (t : Tape) (kernelId biasId inputId : ℕ) :

Conv2D forward/backward via ConvPool FFI (single image, channels-first).

Instances For

ConvTranspose2D (ConvPool FFI) #

def Runtime.Autograd.Cuda.Tape.convTranspose2d {inC outC kH kW stride padding inH inW : ℕ} {h1 : inC ≠ 0} {h2 : kH ≠ 0} {h3 : kW ≠ 0} (t : Tape) (kernelId biasId inputId : ℕ) :

ConvTranspose2D forward/backward via ConvPool FFI (single image, channels-first).

Instances For

Generic naming wrappers #

The CUDA tape exposes conv/max_pool/avg_pool/smooth_max_pool using the same names as the CPU tape. These dispatch to the ConvPool CUDA FFI entrypoints that take per-axis parameters as Array Nat (rank ≤ 8).

The *2d* wrappers remain as concise convenience names for the common rank-2 case.

def Runtime.Autograd.Cuda.Tape.conv {d inC outC : ℕ} {kernel stride padding inSpatial : Vector ℕ d} (t : Tape) (kernelId biasId inputId : ℕ) (hInC : inC ≠ 0) (hKernel : ∀ (i : Fin d), kernel.get i ≠ 0) :

N-D convolution (CUDA) via ConvPool FFI (rank ≤ 8).

Instances For

def Runtime.Autograd.Cuda.Tape.convTranspose {d inC outC : ℕ} {kernel stride padding inSpatial : Vector ℕ d} (t : Tape) (kernelId biasId inputId : ℕ) (hInC : inC ≠ 0) (hKernel : ∀ (i : Fin d), kernel.get i ≠ 0) :

N-D transposed convolution (CUDA) via ConvPool FFI (rank ≤ 8).

Instances For

def Runtime.Autograd.Cuda.Tape.maxPool2d {kH kW inH inW inC stride : ℕ} {h1 : kH ≠ 0} {h2 : kW ≠ 0} (t : Tape) (xId : ℕ) :

MaxPool2D via ConvPool FFI (no padding).

Instances For

def Runtime.Autograd.Cuda.Tape.maxPool2dPad {kH kW inH inW inC stride padding : ℕ} {h1 : kH ≠ 0} {h2 : kW ≠ 0} (t : Tape) (xId : ℕ) :

MaxPool2D via ConvPool FFI (with symmetric padding).

Instances For

def Runtime.Autograd.Cuda.Tape.maxPool {d C : ℕ} {inSpatial kernel stride padding : Vector ℕ d} {hKernel : ∀ (i : Fin d), kernel.get i ≠ 0} (t : Tape) (xId : ℕ) :

N-D max pooling (CUDA) via ConvPool FFI (rank ≤ 8).

Instances For

def Runtime.Autograd.Cuda.Tape.smoothMaxPool2d {kH kW inH inW inC stride : ℕ} {h1 : kH ≠ 0} {h2 : kW ≠ 0} (t : Tape) (xId : ℕ) (beta : Float) :

Smooth max-pool2d (log-sum-exp surrogate) via ConvPool FFI (no padding).

Instances For

def Runtime.Autograd.Cuda.Tape.smoothMaxPool2dPad {kH kW inH inW inC stride padding : ℕ} {h1 : kH ≠ 0} {h2 : kW ≠ 0} (t : Tape) (xId : ℕ) (beta : Float) :

Smooth max-pool2d (log-sum-exp surrogate) via ConvPool FFI (with symmetric padding).

Instances For

def Runtime.Autograd.Cuda.Tape.smoothMaxPool {d C : ℕ} {inSpatial kernel stride padding : Vector ℕ d} {hKernel : ∀ (i : Fin d), kernel.get i ≠ 0} (t : Tape) (xId : ℕ) (beta : Float) :

N-D smooth max pooling (CUDA) via ConvPool FFI (rank ≤ 8).

Instances For

def Runtime.Autograd.Cuda.Tape.avgPool2d {kH kW inH inW inC stride : ℕ} (h1 : kH ≠ 0) (h2 : kW ≠ 0) (t : Tape) (xId : ℕ) :

AvgPool2D via ConvPool FFI (no padding).

Instances For

def Runtime.Autograd.Cuda.Tape.avgPool2dPad {kH kW inH inW inC stride padding : ℕ} (h1 : kH ≠ 0) (h2 : kW ≠ 0) (t : Tape) (xId : ℕ) :

AvgPool2D via ConvPool FFI (with symmetric padding).

Instances For

def Runtime.Autograd.Cuda.Tape.avgPool {d C : ℕ} {inSpatial kernel stride padding : Vector ℕ d} (hKernel : ∀ (i : Fin d), kernel.get i ≠ 0) (t : Tape) (xId : ℕ) :

N-D average pooling (CUDA) via ConvPool FFI (rank ≤ 8).

Instances For

Normalization #

def Runtime.Autograd.Cuda.Tape.layerNorm {seqLen embedDim : ℕ} (h_seq_pos : seqLen > 0) (h_embed_pos : embedDim > 0) (t : Tape) (xId gammaId betaId : ℕ) :

LayerNorm over the last dimension for (seqLen, embedDim) buffers.

This implementation uses the standard stable formulas and is expressed in terms of existing CUDA kernels (axis reductions + broadcasts + pointwise ops).

Instances For

def Runtime.Autograd.Cuda.Tape.batchnormChannelFirst {channels height width : ℕ} (h_c : channels > 0) (h_h : height > 0) (h_w : width > 0) (t : Tape) (xId gammaId betaId : ℕ) :

BatchNorm for a single channel-first image (C,H,W) (no batch axis).

We normalize per-channel across the spatial dimension H*W, reusing the same math as layer-norm by treating the buffer as a (channels, height*width) matrix.

Instances For

Softmax (last axis, row folding) #

We implement softmax along the last axis by folding all leading dimensions into one rows axis. This covers:

2D softmax ((rows, cols)),
3D batched softmax ((batch, rows, cols)) by folding batch*rows into rows.

def Runtime.Autograd.Cuda.Tape.softmax {s : Spec.Shape} (t : Tape) (xId : ℕ) :

Instances For

def Runtime.Autograd.Cuda.Tape.logSoftmax {s : Spec.Shape} (t : Tape) (xId : ℕ) :

Stable log-softmax along the last axis, implemented directly on CUDA buffers.

Instances For

Multi-head self-attention #

Forward structure matches Spec.MultiHeadAttention.forward:

Q = x @ Wq, K = x @ Wk, V = x @ Wv
reshape to heads (numHeads, n, headDim)
attention per head (batched): softmax(Q Kᵀ / sqrt(headDim)) @ V
combine heads, then output projection @ Wo

Masking:

If mask is provided, we upload it as a float32 {0,1} matrix and apply the same semantics as the spec: masked logits are replaced by -1000.0 before softmax, and gradients through masked entries are zeroed.
This incurs a host-to-device copy for the mask (since the mask is a host Tensor Bool).

def Runtime.Autograd.Cuda.Tape.multiHeadAttention {n numHeads dModel headDim : ℕ} (h1 : n ≠ 0) (t : Tape) (wqId wkId wvId woId xId : ℕ) (mask : Option (Spec.Tensor Bool (Spec.Shape.dim n (Spec.Shape.dim n Spec.Shape.scalar))) := none) (useFlash : Bool := true) :