TorchLean API

NN.Runtime.Autograd.Engine.Cuda.Ops

Small helpers #

Checked NatUInt32 conversion for CUDA boundaries. Errors if n ≥ 2^32.

Instances For

    Number of elements in a runtime shape, checked for the UInt32 CUDA ABI.

    Instances For

      Fold all leading axes into a row count and keep the last axis as the column count.

      CUDA softmax/log-softmax kernels are 2D row kernels. This helper gives the shared convention used for vectors, matrices, and higher-rank tensors: softmax is always along the last axis.

      Instances For

        Broadcast a scalar CUDA buffer to outShape. Used by scalar reductions during backprop.

        Instances For

          Logistic sigmoid implemented from primitive CUDA elementwise ops.

          Instances For

            Hyperbolic tangent implemented as (exp(2x)-1)/(exp(2x)+1).

            Instances For

              Numerically stable softplus: max(x,0) + log(1 + exp(-abs(x))).

              Instances For

                Row-wise stable softmax plus scratch buffers that should be released after the backward pass.

                The output is the first component. The scratch list records temporary buffers created while computing the stable formula so long training runs do not wait for native finalizers.

                Instances For

                  Row-wise softmax VJP: dX = y * (dY - sum(dY*y, axis=1)).

                  Instances For

                    Row-wise stable log-softmax plus scratch buffers that should be released after backprop.

                    This computes x - rowMax - log(sum(exp(x-rowMax))) directly, avoiding the less stable log(softmax(x)) route.

                    Instances For

                      Row-wise log-softmax VJP: dX = dY - exp(y) * sum(dY, axis=1).

                      Instances For

                        Elementwise ops #

                        Elementwise addition.

                        Instances For

                          Elementwise subtraction.

                          Instances For

                            Elementwise multiplication.

                            Instances For

                              Multiply by a scalar constant.

                              Instances For

                                Elementwise abs.

                                Instances For

                                  Elementwise sqrt.

                                  Instances For

                                    Clamp each element to [lo, hi].

                                    Instances For

                                      Elementwise max.

                                      Instances For

                                        Elementwise min.

                                        Instances For

                                          Elementwise division.

                                          Instances For

                                            Elementwise ReLU.

                                            Instances For

                                              Elementwise exp.

                                              Instances For

                                                Elementwise log.

                                                Instances For

                                                  Elementwise reciprocal 1/x.

                                                  Instances For

                                                    Elementwise "safe log" that protects against log(0) by adding a small ε internally.

                                                    Spec semantics: log(softplus(x) + ε).

                                                    Instances For

                                                      Elementwise sigmoid (logistic).

                                                      Instances For

                                                        Elementwise tanh.

                                                        Instances For

                                                          Elementwise softplus.

                                                          Instances For

                                                            Reductions / views #

                                                            Reduce-sum of all entries, producing a scalar.

                                                            Instances For

                                                              Flatten s into a 1D vector of length Shape.size s.

                                                              Instances For
                                                                def Runtime.Autograd.Cuda.Tape.reshape {s₁ s₂ : Spec.Shape} (t : Tape) (xId : ) (_h : s₁.size = s₂.size) :

                                                                Reshape a buffer while preserving number of elements.

                                                                This is a no-copy view operation: it reuses the same contiguous buffer.

                                                                Instances For

                                                                  Transpose a 2D buffer.

                                                                  Instances For

                                                                    Swap adjacent axes at a given depth in an N-D buffer.

                                                                    If depth is out of range, this is treated as the identity (matches the spec-layer helper).

                                                                    Instances For

                                                                      Permute a 3D tensor (a,b,c) → (b,c,a).

                                                                      Instances For

                                                                        Permute a 3D tensor (a,b,c) → (c,a,b).

                                                                        Instances For

                                                                          Swap the last two axes of a 3D tensor (a,b,c) → (a,c,b).

                                                                          Instances For
                                                                            def Runtime.Autograd.Cuda.Tape.broadcastTo {s₁ s₂ : Spec.Shape} (t : Tape) (cb : s₁.CanBroadcastTo s₂) (xId : ) :

                                                                            Broadcast x : s₁ to s₂.

                                                                            Forward: broadcastTo. Backward: sum-reduce broadcasted axes (reduceFromBroadcastTo).

                                                                            Instances For
                                                                              def Runtime.Autograd.Cuda.Tape.reduceSum {s : Spec.Shape} (axis : ) [valid : Spec.Shape.valid_axis_inst axis s] [wf : s.WellFormed] (t : Tape) (xId : ) :

                                                                              Reduce-sum along axis.

                                                                              Instances For
                                                                                def Runtime.Autograd.Cuda.Tape.reduceMean {s : Spec.Shape} (axis : ) [valid : Spec.Shape.valid_axis_inst axis s] [wf : s.WellFormed] (t : Tape) (xId : ) :

                                                                                Reduce-mean along axis.

                                                                                Instances For

                                                                                  Linear algebra #

                                                                                  def Runtime.Autograd.Cuda.Tape.matmul {m n p : } (t : Tape) (aId bId : ) :

                                                                                  2D matrix multiply.

                                                                                  Instances For
                                                                                    def Runtime.Autograd.Cuda.Tape.bmm {batch m n p : } (t : Tape) (aId bId : ) :

                                                                                    Batched matrix multiply.

                                                                                    Instances For
                                                                                      def Runtime.Autograd.Cuda.Tape.spectralConv1dRfft {grid width modes : } (t : Tape) (xId wReId wImId : ) :

                                                                                      Fused real-FFT spectral convolution used by the CUDA FNO1D path.

                                                                                      Shapes:

                                                                                      • x : (grid, width),
                                                                                      • wRe, wIm : (modes, width, width),
                                                                                      • output y : (grid, width).

                                                                                      The low-level buffer primitive owns the numerical contract and VJP: rfft(x) is unnormalized, the inverse is normalized, and the backward kernels include the half-spectrum adjoint factors for real FFTs. This tape node simply records those three parent dependencies and checks the runtime shapes before calling the native kernels.

                                                                                      Instances For

                                                                                        Linear layer / losses #

                                                                                        def Runtime.Autograd.Cuda.Tape.linear {outDim inDim : } (t : Tape) (wId bId xId : ) :

                                                                                        Linear layer: y = W·x + b with W : (outDim,inDim), x : inDim, b : outDim.

                                                                                        Instances For
                                                                                          def Runtime.Autograd.Cuda.Tape.mseLoss {s : Spec.Shape} (t : Tape) (yhatId targetId : ) :

                                                                                          Mean-squared-error loss with "mean" reduction (single scalar output).

                                                                                          Instances For

                                                                                            Concat / slice (1D) #

                                                                                            def Runtime.Autograd.Cuda.Tape.concat1d {n m : } (t : Tape) (aId bId : ) :

                                                                                            Concatenate two 1D buffers.

                                                                                            Instances For

                                                                                              Concatenate two 1D tensors (CPU tape name).

                                                                                              Instances For
                                                                                                def Runtime.Autograd.Cuda.Tape.slice1d {n start len : } (t : Tape) (xId : ) :

                                                                                                Slice a 1D buffer.

                                                                                                Instances For

                                                                                                  Concat / slice along dim 0 #

                                                                                                  Concatenate along dim 0 for tensors with leading dimension (CPU tape name).

                                                                                                  Instances For
                                                                                                    def Runtime.Autograd.Cuda.Tape.sliceRange0 {n : } {s : Spec.Shape} (t : Tape) (xId start len : ) (_h : len + start n) :

                                                                                                    Slice along dim 0: x[start:start+len] (CPU tape name).

                                                                                                    Instances For

                                                                                                      Gather / scatter (host Nat indices) #

                                                                                                      Indices are non-differentiable and remain on the host. Kernels totalize out-of-bounds indices as documented in NN.Runtime.Autograd.Engine.Cuda.Kernels.

                                                                                                      Gather a scalar from a 1D vector using a compile-time index.

                                                                                                      Instances For
                                                                                                        def Runtime.Autograd.Cuda.Tape.gatherRow {rows cols : } (t : Tape) (xId : ) (i : Fin rows) :

                                                                                                        Gather a row from a 2D matrix using a compile-time index.

                                                                                                        Instances For

                                                                                                          Gather a scalar from a 1D vector using a runtime Nat index (totalized by the kernel).

                                                                                                          Instances For

                                                                                                            Gather k scalars from a length-n vector.

                                                                                                            Instances For

                                                                                                              Gather k rows from a (rows, cols) matrix (row-major).

                                                                                                              Instances For
                                                                                                                def Runtime.Autograd.Cuda.Tape.scatterAddVec {n : } (t : Tape) (xId vId : ) (i : Fin n) :

                                                                                                                Scatter-add into a vector: out = x with out[i] += v.

                                                                                                                Instances For
                                                                                                                  def Runtime.Autograd.Cuda.Tape.scatterAddRow {rows cols : } (t : Tape) (xId vId : ) (i : Fin rows) :

                                                                                                                  Scatter-add into a matrix row: out = x with out[i,:] += v.

                                                                                                                  Instances For

                                                                                                                    Conv2D + pooling (ConvPool FFI) #

                                                                                                                    def Runtime.Autograd.Cuda.Tape.conv2d {inC outC kH kW stride padding inH inW : } {h1 : inC 0} {h2 : kH 0} {h3 : kW 0} (t : Tape) (kernelId biasId inputId : ) :

                                                                                                                    Conv2D forward/backward via ConvPool FFI (single image, channels-first).

                                                                                                                    Instances For

                                                                                                                      ConvTranspose2D (ConvPool FFI) #

                                                                                                                      def Runtime.Autograd.Cuda.Tape.convTranspose2d {inC outC kH kW stride padding inH inW : } {h1 : inC 0} {h2 : kH 0} {h3 : kW 0} (t : Tape) (kernelId biasId inputId : ) :

                                                                                                                      ConvTranspose2D forward/backward via ConvPool FFI (single image, channels-first).

                                                                                                                      Instances For

                                                                                                                        Generic naming wrappers #

                                                                                                                        The CUDA tape exposes conv/max_pool/avg_pool/smooth_max_pool using the same names as the CPU tape. These dispatch to the ConvPool CUDA FFI entrypoints that take per-axis parameters as Array Nat (rank ≤ 8).

                                                                                                                        The *2d* wrappers remain as concise convenience names for the common rank-2 case.

                                                                                                                        def Runtime.Autograd.Cuda.Tape.conv {d inC outC : } {kernel stride padding inSpatial : Vector d} (t : Tape) (kernelId biasId inputId : ) (hInC : inC 0) (hKernel : ∀ (i : Fin d), kernel.get i 0) :

                                                                                                                        N-D convolution (CUDA) via ConvPool FFI (rank ≤ 8).

                                                                                                                        Instances For
                                                                                                                          def Runtime.Autograd.Cuda.Tape.convTranspose {d inC outC : } {kernel stride padding inSpatial : Vector d} (t : Tape) (kernelId biasId inputId : ) (hInC : inC 0) (hKernel : ∀ (i : Fin d), kernel.get i 0) :

                                                                                                                          N-D transposed convolution (CUDA) via ConvPool FFI (rank ≤ 8).

                                                                                                                          Instances For
                                                                                                                            def Runtime.Autograd.Cuda.Tape.maxPool2d {kH kW inH inW inC stride : } {h1 : kH 0} {h2 : kW 0} (t : Tape) (xId : ) :

                                                                                                                            MaxPool2D via ConvPool FFI (no padding).

                                                                                                                            Instances For
                                                                                                                              def Runtime.Autograd.Cuda.Tape.maxPool2dPad {kH kW inH inW inC stride padding : } {h1 : kH 0} {h2 : kW 0} (t : Tape) (xId : ) :

                                                                                                                              MaxPool2D via ConvPool FFI (with symmetric padding).

                                                                                                                              Instances For
                                                                                                                                def Runtime.Autograd.Cuda.Tape.maxPool {d C : } {inSpatial kernel stride padding : Vector d} {hKernel : ∀ (i : Fin d), kernel.get i 0} (t : Tape) (xId : ) :

                                                                                                                                N-D max pooling (CUDA) via ConvPool FFI (rank ≤ 8).

                                                                                                                                Instances For
                                                                                                                                  def Runtime.Autograd.Cuda.Tape.smoothMaxPool2d {kH kW inH inW inC stride : } {h1 : kH 0} {h2 : kW 0} (t : Tape) (xId : ) (beta : Float) :

                                                                                                                                  Smooth max-pool2d (log-sum-exp surrogate) via ConvPool FFI (no padding).

                                                                                                                                  Instances For
                                                                                                                                    def Runtime.Autograd.Cuda.Tape.smoothMaxPool2dPad {kH kW inH inW inC stride padding : } {h1 : kH 0} {h2 : kW 0} (t : Tape) (xId : ) (beta : Float) :

                                                                                                                                    Smooth max-pool2d (log-sum-exp surrogate) via ConvPool FFI (with symmetric padding).

                                                                                                                                    Instances For
                                                                                                                                      def Runtime.Autograd.Cuda.Tape.smoothMaxPool {d C : } {inSpatial kernel stride padding : Vector d} {hKernel : ∀ (i : Fin d), kernel.get i 0} (t : Tape) (xId : ) (beta : Float) :

                                                                                                                                      N-D smooth max pooling (CUDA) via ConvPool FFI (rank ≤ 8).

                                                                                                                                      Instances For
                                                                                                                                        def Runtime.Autograd.Cuda.Tape.avgPool2d {kH kW inH inW inC stride : } (h1 : kH 0) (h2 : kW 0) (t : Tape) (xId : ) :

                                                                                                                                        AvgPool2D via ConvPool FFI (no padding).

                                                                                                                                        Instances For
                                                                                                                                          def Runtime.Autograd.Cuda.Tape.avgPool2dPad {kH kW inH inW inC stride padding : } (h1 : kH 0) (h2 : kW 0) (t : Tape) (xId : ) :

                                                                                                                                          AvgPool2D via ConvPool FFI (with symmetric padding).

                                                                                                                                          Instances For
                                                                                                                                            def Runtime.Autograd.Cuda.Tape.avgPool {d C : } {inSpatial kernel stride padding : Vector d} (hKernel : ∀ (i : Fin d), kernel.get i 0) (t : Tape) (xId : ) :

                                                                                                                                            N-D average pooling (CUDA) via ConvPool FFI (rank ≤ 8).

                                                                                                                                            Instances For

                                                                                                                                              Normalization #

                                                                                                                                              def Runtime.Autograd.Cuda.Tape.layerNorm {seqLen embedDim : } (h_seq_pos : seqLen > 0) (h_embed_pos : embedDim > 0) (t : Tape) (xId gammaId betaId : ) :

                                                                                                                                              LayerNorm over the last dimension for (seqLen, embedDim) buffers.

                                                                                                                                              This implementation uses the standard stable formulas and is expressed in terms of existing CUDA kernels (axis reductions + broadcasts + pointwise ops).

                                                                                                                                              Instances For
                                                                                                                                                def Runtime.Autograd.Cuda.Tape.batchnormChannelFirst {channels height width : } (h_c : channels > 0) (h_h : height > 0) (h_w : width > 0) (t : Tape) (xId gammaId betaId : ) :

                                                                                                                                                BatchNorm for a single channel-first image (C,H,W) (no batch axis).

                                                                                                                                                We normalize per-channel across the spatial dimension H*W, reusing the same math as layer-norm by treating the buffer as a (channels, height*width) matrix.

                                                                                                                                                Instances For

                                                                                                                                                  Softmax (last axis, row folding) #

                                                                                                                                                  We implement softmax along the last axis by folding all leading dimensions into one rows axis. This covers:

                                                                                                                                                  Stable log-softmax along the last axis, implemented directly on CUDA buffers.

                                                                                                                                                  Instances For

                                                                                                                                                    Multi-head self-attention #

                                                                                                                                                    Forward structure matches Spec.MultiHeadAttention.forward:

                                                                                                                                                    1. Q = x @ Wq, K = x @ Wk, V = x @ Wv
                                                                                                                                                    2. reshape to heads (numHeads, n, headDim)
                                                                                                                                                    3. attention per head (batched): softmax(Q Kᵀ / sqrt(headDim)) @ V
                                                                                                                                                    4. combine heads, then output projection @ Wo

                                                                                                                                                    Masking:

                                                                                                                                                    def Runtime.Autograd.Cuda.Tape.multiHeadAttention {n numHeads dModel headDim : } (h1 : n 0) (t : Tape) (wqId wkId wvId woId xId : ) (mask : Option (Spec.Tensor Bool (Spec.Shape.dim n (Spec.Shape.dim n Spec.Shape.scalar))) := none) (useFlash : Bool := true) :
                                                                                                                                                    Instances For