Small helpers #
Number of elements in a runtime shape, checked for the UInt32 CUDA ABI.
Instances For
Fold all leading axes into a row count and keep the last axis as the column count.
CUDA softmax/log-softmax kernels are 2D row kernels. This helper gives the shared convention used for vectors, matrices, and higher-rank tensors: softmax is always along the last axis.
Instances For
Broadcast a scalar CUDA buffer to outShape. Used by scalar reductions during backprop.
Instances For
Logistic sigmoid implemented from primitive CUDA elementwise ops.
Instances For
Hyperbolic tangent implemented as (exp(2x)-1)/(exp(2x)+1).
Instances For
Numerically stable softplus: max(x,0) + log(1 + exp(-abs(x))).
Instances For
Row-wise stable softmax plus scratch buffers that should be released after the backward pass.
The output is the first component. The scratch list records temporary buffers created while computing the stable formula so long training runs do not wait for native finalizers.
Instances For
Instances For
Row-wise softmax VJP: dX = y * (dY - sum(dY*y, axis=1)).
Instances For
Row-wise stable log-softmax plus scratch buffers that should be released after backprop.
This computes x - rowMax - log(sum(exp(x-rowMax))) directly, avoiding the less stable
log(softmax(x)) route.
Instances For
Instances For
Row-wise log-softmax VJP: dX = dY - exp(y) * sum(dY, axis=1).
Instances For
Elementwise ops #
Elementwise addition.
Instances For
Elementwise subtraction.
Instances For
Elementwise multiplication.
Instances For
Multiply by a scalar constant.
Instances For
Elementwise abs.
Instances For
Elementwise sqrt.
Instances For
Clamp each element to [lo, hi].
Instances For
Elementwise max.
Instances For
Elementwise min.
Instances For
Elementwise division.
Instances For
Elementwise ReLU.
Instances For
Elementwise exp.
Instances For
Elementwise log.
Instances For
Elementwise reciprocal 1/x.
Instances For
Elementwise "safe log" that protects against log(0) by adding a small ε internally.
Spec semantics: log(softplus(x) + ε).
Instances For
Elementwise sigmoid (logistic).
Instances For
Elementwise tanh.
Instances For
Elementwise softplus.
Instances For
Reductions / views #
Reduce-sum of all entries, producing a scalar.
Instances For
Flatten s into a 1D vector of length Shape.size s.
Instances For
Swap adjacent axes at a given depth in an N-D buffer.
If depth is out of range, this is treated as the identity (matches the spec-layer helper).
Instances For
Broadcast x : s₁ to s₂.
Forward: broadcastTo.
Backward: sum-reduce broadcasted axes (reduceFromBroadcastTo).
Instances For
Reduce-sum along axis.
Instances For
Reduce-mean along axis.
Instances For
Linear algebra #
Fused real-FFT spectral convolution used by the CUDA FNO1D path.
Shapes:
x : (grid, width),wRe, wIm : (modes, width, width),- output
y : (grid, width).
The low-level buffer primitive owns the numerical contract and VJP:
rfft(x) is unnormalized, the inverse is normalized, and the backward kernels include the
half-spectrum adjoint factors for real FFTs. This tape node simply records those three parent
dependencies and checks the runtime shapes before calling the native kernels.
Instances For
Linear layer / losses #
Mean-squared-error loss with "mean" reduction (single scalar output).
Instances For
Concat / slice (1D) #
Concat / slice along dim 0 #
Concatenate along dim 0 for tensors with leading dimension (CPU tape name).
Instances For
Gather / scatter (host Nat indices) #
Indices are non-differentiable and remain on the host. Kernels totalize out-of-bounds indices as
documented in NN.Runtime.Autograd.Engine.Cuda.Kernels.
Instances For
Gather k scalars from a length-n vector.
Instances For
Gather k rows from a (rows, cols) matrix (row-major).
Instances For
Conv2D + pooling (ConvPool FFI) #
ConvTranspose2D (ConvPool FFI) #
Generic naming wrappers #
The CUDA tape exposes conv/max_pool/avg_pool/smooth_max_pool using the same names as the
CPU tape. These dispatch to the ConvPool CUDA FFI entrypoints that take per-axis parameters as
Array Nat (rank ≤ 8).
The *2d* wrappers remain as concise convenience names for the common rank-2 case.
Normalization #
LayerNorm over the last dimension for (seqLen, embedDim) buffers.
This implementation uses the standard stable formulas and is expressed in terms of existing CUDA kernels (axis reductions + broadcasts + pointwise ops).
Instances For
BatchNorm for a single channel-first image (C,H,W) (no batch axis).
We normalize per-channel across the spatial dimension H*W, reusing the same math as layer-norm
by treating the buffer as a (channels, height*width) matrix.
Instances For
Softmax (last axis, row folding) #
We implement softmax along the last axis by folding all leading dimensions into one rows axis.
This covers:
- 2D softmax (
(rows, cols)), - 3D batched softmax (
(batch, rows, cols)) by foldingbatch*rowsintorows.
Instances For
Stable log-softmax along the last axis, implemented directly on CUDA buffers.
Instances For
Multi-head self-attention #
Forward structure matches Spec.MultiHeadAttention.forward:
Q = x @ Wq,K = x @ Wk,V = x @ Wv- reshape to heads
(numHeads, n, headDim) - attention per head (batched):
softmax(Q Kᵀ / sqrt(headDim)) @ V - combine heads, then output projection
@ Wo
Masking:
- If
maskis provided, we upload it as a float32{0,1}matrix and apply the same semantics as the spec: masked logits are replaced by-1000.0before softmax, and gradients through masked entries are zeroed. - This incurs a host-to-device copy for the mask (since the mask is a host
Tensor Bool).