TorchLean API

NN.Runtime.Autograd.Engine.Core

Engine Core #

A small dynamic (DAG) autograd engine.

This is intended to be the "runtime" counterpart to Spec.OpSpec: instead of manually writing backward passes end-to-end, you build a tape during a forward pass and then call backward.

Design goals:

Scope boundaries:

References (PyTorch / background reading):

Core Types #

The eager autograd engine is built out of a few small pieces:

@[reducible, inline]

Runtime error monad for the eager autograd engine.

We use plain Except String (instead of IO exceptions) so the tape constructors remain pure and easy to test. Front-ends that prefer exceptions can use okOrThrow.

Instances For

    Convert an Autograd.Result into an IO action by throwing IO.userError on failure.

    This is mainly used by the imperative Torch/TorchLean front-ends to keep their code readable.

    Instances For

      Pack a typed tensor as a runtime AnyTensor.

      This is the primary bridge between the dependently-typed Tensor α s world and the dynamic tape, which stores heterogeneous shapes in a single array.

      Instances For
        def Runtime.Autograd.AnyTensor.cast {α : Type} {s₂ : Spec.Shape} (t : AnyTensor α) (h : t.s = s₂) :
        Spec.Tensor α s₂

        Cast an AnyTensor to a specific shape, given an equality proof.

        This is used after dynamic shape checks (e.g. Tape.requireValue).

        Instances For

          Accumulate two AnyTensor values by elementwise addition, with a dynamic shape check.

          This is the heart of DAG support: if two different paths contribute gradients to the same parent, we sum the contributions.

          Instances For
            structure Runtime.Autograd.Node (α : Type) :

            A tape node representing a single tensor value in the recorded computation graph.

            Fields:

            • value: the forward value (shape-erased).
            • parents: ids of parent nodes in the tape.
            • backward: a local VJP rule. Given an upstream cotangent for value, it returns a list of (parentId, parentCotangent) contributions (one per parent, usually).

            PyTorch comparison: analogous to an autograd Function instance + saved tensors, but here we store the backward closure directly.

            • name : Option String

              Optional node name used for debugging and pretty-printing.

            • value : AnyTensor α

              Forward value computed at this node (shape-erased).

            • requires_grad : Bool

              Whether reverse-mode propagation should visit this node.

              If false, reverse-mode traversal skips this node and does not accumulate gradients into it.

            • parents : List

              Parent node ids (dependencies) in the tape.

            • backward : AnyTensor αResult (List ( × AnyTensor α))

              Local VJP rule for this node.

              Given an upstream cotangent for value, return a list of (parentId, parentCotangent) contributions. If multiple children contribute to the same parent, the engine will sum contributions via AnyTensor.add.

            Instances For
              structure Runtime.Autograd.Tape (α : Type) :

              Autograd tape: a grow-only array of nodes.

              Node ids are array indices (Nat). All ops append exactly one node and return its id. This makes it easy to implement reverse-mode by traversing ids in reverse order.

              • nodes : Array (Node α)

                Tape nodes in evaluation order.

                Node ids are array indices (Nat). Each tape op appends exactly one node and returns its id.

              Instances For

                Tape Construction #

                The Tape namespace provides pure constructors for building a recorded computation graph. Each op appends exactly one node and returns the updated tape plus the new node id.

                If you prefer an implicit tape-threading style, see NN.Runtime.Autograd.Engine.TapeM.

                Empty tape (no nodes).

                Instances For

                  Number of nodes stored in the tape.

                  Instances For
                    def Runtime.Autograd.Tape.getNode? {α : Type} (t : Tape α) (id : ) :

                    Read a node by id (returns none if out of bounds).

                    Instances For
                      def Runtime.Autograd.Tape.getValue? {α : Type} (t : Tape α) (id : ) :

                      Read just the stored forward value for a node id.

                      Instances For
                        def Runtime.Autograd.Tape.addNode {α : Type} (t : Tape α) (node : Node α) :

                        Append a node and return its id.

                        Invariant: the returned id is t.size, the pre-append size of the tape.

                        Instances For
                          @[simp]
                          theorem Runtime.Autograd.Tape.addNode_id {α : Type} (t : Tape α) (node : Node α) :
                          (t.addNode node).2 = t.size

                          addNode returns the current tape size as the fresh node id.

                          @[simp]
                          theorem Runtime.Autograd.Tape.size_addNode {α : Type} (t : Tape α) (node : Node α) :
                          (t.addNode node).1.size = t.size + 1

                          Appending a node increases the tape size by one.

                          def Runtime.Autograd.Tape.leaf {α : Type} {s : Spec.Shape} (t : Tape α) (value : Spec.Tensor α s) (name : Option String := none) (requires_grad : Bool := true) :

                          Add a leaf node (no parents).

                          PyTorch comparison: a tensor that enters the graph as a leaf (e.g. input or parameter value).

                          Instances For

                            Read a typed tensor value from a tape node id.

                            This is the main "dynamic check" boundary in the eager runtime:

                            • fails if the id is invalid, or
                            • fails if the stored runtime shape doesn't match the expected dependent shape s.
                            Instances For

                              Read a typed upstream gradient tensor from a runtime AnyTensor.

                              This is the backward analogue of Tape.requireValue: it checks that the upstream gradient has the expected shape τ and then performs the dependent cast.

                              Instances For
                                def Runtime.Autograd.Tape.unary {α : Type} [DecidableEq Spec.Shape] {σ τ : Spec.Shape} (t : Tape α) (opName : String) (xId : ) (forward : Spec.Tensor α σSpec.Tensor α τ) (backward : Spec.Tensor α σSpec.Tensor α τSpec.Tensor α σ) :

                                Generic constructor for unary ops.

                                You provide:

                                • forward : Tensor α σ → Tensor α τ
                                • backward : Tensor α σ → Tensor α τ → Tensor α σ (a VJP rule; note it may depend on the input)

                                The returned node stores the forward value and a backward closure that checks the upstream gradient's shape and returns the parent contribution.

                                Instances For

                                  Flatten a tensor s into a 1D vector of length Shape.size s.

                                  PyTorch comparison: torch.flatten(x) with start_dim=0.

                                  Instances For
                                    def Runtime.Autograd.Tape.reshape {α : Type} [Inhabited α] [DecidableEq Spec.Shape] {s₁ s₂ : Spec.Shape} (t : Tape α) (xId : ) (h : s₁.size = s₂.size) :

                                    Reshape a tensor while preserving number of elements.

                                    The proof argument h enforces Shape.size s₁ = Shape.size s₂. PyTorch comparison: x.reshape(new_shape) / x.view(new_shape) (when valid).

                                    Instances For

                                      Transpose a 2D matrix. PyTorch: x.t() / x.transpose(0,1).

                                      Instances For

                                        Permute a 3D tensor (a,b,c) → (b,c,a). PyTorch: x.permute(1,2,0).

                                        Instances For

                                          Permute a 3D tensor (a,b,c) → (c,a,b). PyTorch: x.permute(2,0,1).

                                          Instances For

                                            Swap the last two axes of a 3D tensor (a,b,c) → (a,c,b). PyTorch: x.transpose(1,2).

                                            Instances For

                                              Swap adjacent axes at a given depth inside a general Shape.

                                              This is a more general analogue of transpose operations.

                                              Instances For
                                                def Runtime.Autograd.Tape.broadcastTo {α : Type} [Inhabited α] [Add α] [Zero α] [DecidableEq Spec.Shape] {s₁ s₂ : Spec.Shape} (cb : s₁.CanBroadcastTo s₂) (t : Tape α) (xId : ) :

                                                Broadcast x : s₁ to s₂ using a proof Shape.CanBroadcastTo s₁ s₂.

                                                PyTorch comparison: implicit broadcasting / x.expand(...).

                                                Instances For
                                                  def Runtime.Autograd.Tape.reduceSum {α : Type} [Add α] [Zero α] [Inhabited α] [DecidableEq Spec.Shape] {s : Spec.Shape} (axis : ) [valid : Spec.Shape.valid_axis_inst axis s] [wf : s.WellFormed] (t : Tape α) (xId : ) :

                                                  Sum-reduce along axis.

                                                  PyTorch comparison: torch.sum(x, dim=axis).

                                                  Instances For
                                                    def Runtime.Autograd.Tape.reduceMean {α : Type} [Context α] [Inhabited α] [DecidableEq Spec.Shape] {s : Spec.Shape} (axis : ) [valid : Spec.Shape.valid_axis_inst axis s] [wf : s.WellFormed] (t : Tape α) (xId : ) :

                                                    Mean-reduce along axis.

                                                    Backward rule: broadcast the upstream cotangent back to s and divide by the reduced dimension. PyTorch comparison: torch.mean(x, dim=axis).

                                                    Instances For
                                                      def Runtime.Autograd.Tape.gatherScalar {α : Type} [Zero α] [DecidableEq Spec.Shape] {n : } (t : Tape α) (xId : ) (i : Fin n) :

                                                      Gather a scalar from a 1D vector using a compile-time index Fin n.

                                                      PyTorch comparison: x[i] (1D indexing).

                                                      Instances For
                                                        def Runtime.Autograd.Tape.gatherRow {α : Type} [Zero α] [DecidableEq Spec.Shape] {rows cols : } (t : Tape α) (xId : ) (i : Fin rows) :

                                                        Gather a row from a 2D matrix using a compile-time index Fin rows.

                                                        PyTorch comparison: x[i] for 2D tensors (row indexing).

                                                        Instances For

                                                          Gather a scalar from a 1D vector using a runtime Nat index.

                                                          Out-of-bounds indices are totalized to return 0. PyTorch comparison: x[i] would raise on out-of-range; here we return 0 to keep the op total.

                                                          Instances For

                                                            Gather k scalars from a 1D vector using an explicit index tensor.

                                                            Out-of-bounds indices are totalized to 0. In the backward pass, gradients are accumulated for repeated indices (scatter-add semantics). PyTorch comparison: related to torch.gather / advanced indexing.

                                                            Instances For

                                                              Gather k rows from a 2D matrix using an explicit index tensor.

                                                              Out-of-bounds indices are totalized to zero rows; backward accumulates gradients into selected rows (scatter-add), including repeated indices.

                                                              Instances For
                                                                def Runtime.Autograd.Tape.scatterAddVec {α : Type} [Add α] [Zero α] [DecidableEq Spec.Shape] {n : } (t : Tape α) (xId vId : ) (i : Fin n) :

                                                                Scatter-add into a vector: return a copy of x with x[i] += v.

                                                                Backward: gradient w.r.t. x is the upstream dL/dy, and gradient w.r.t. v is the gathered scalar dL/dy[i].

                                                                Instances For
                                                                  def Runtime.Autograd.Tape.scatterAddRow {α : Type} [Add α] [Zero α] [DecidableEq Spec.Shape] {rows cols : } (t : Tape α) (xId vId : ) (i : Fin rows) :

                                                                  Scatter-add into a matrix row: return a copy of x with x[i,:] += v.

                                                                  Backward: gradient w.r.t. v is the gathered row dL/dy[i,:].

                                                                  Instances For
                                                                    def Runtime.Autograd.Tape.add {α : Type} [Add α] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (aId bId : ) :

                                                                    Elementwise addition. PyTorch: torch.add / +.

                                                                    Instances For
                                                                      def Runtime.Autograd.Tape.sub {α : Type} [Sub α] [Zero α] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (aId bId : ) :

                                                                      Elementwise subtraction. PyTorch: torch.sub / -.

                                                                      Instances For
                                                                        def Runtime.Autograd.Tape.mul {α : Type} [Mul α] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (aId bId : ) :

                                                                        Elementwise multiplication. PyTorch: torch.mul / *.

                                                                        Instances For
                                                                          def Runtime.Autograd.Tape.scale {α : Type} [Mul α] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (xId : ) (c : α) :

                                                                          Multiply a tensor by a scalar constant. PyTorch: x * c for Python scalar c.

                                                                          Instances For
                                                                            def Runtime.Autograd.Tape.abs {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (xId : ) :

                                                                            Elementwise absolute value.

                                                                            Backward uses the sign function (sign_spec) as a subgradient at 0. PyTorch comparison: torch.abs.

                                                                            Instances For
                                                                              def Runtime.Autograd.Tape.sqrt {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (xId : ) :

                                                                              Elementwise square root.

                                                                              Backward uses 1 / (2 * sqrt(x)) for x > 0 and 0 otherwise (totalized). PyTorch comparison: torch.sqrt.

                                                                              Instances For
                                                                                def Runtime.Autograd.Tape.clamp {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (xId : ) (minVal maxVal : α) :

                                                                                Elementwise clamp to [minVal, maxVal].

                                                                                Backward multiplies by an indicator of the open interval (minVal, maxVal) (zero at boundaries). PyTorch comparison: torch.clamp.

                                                                                Instances For
                                                                                  def Runtime.Autograd.Tape.max {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (aId bId : ) :

                                                                                  Elementwise maximum.

                                                                                  Tie-breaking: when a = b, the upstream gradient is split evenly (0.5) between both inputs. PyTorch comparison: torch.maximum.

                                                                                  Instances For
                                                                                    def Runtime.Autograd.Tape.min {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (aId bId : ) :

                                                                                    Elementwise minimum.

                                                                                    Tie-breaking: when a = b, the upstream gradient is split evenly (0.5) between both inputs. PyTorch comparison: torch.minimum.

                                                                                    Instances For
                                                                                      def Runtime.Autograd.Tape.relu {α : Type} [Mul α] [Zero α] [Max α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (xId : ) :

                                                                                      Elementwise ReLU.

                                                                                      PyTorch comparison: torch.relu(x) / torch.nn.functional.relu(x).

                                                                                      Instances For
                                                                                        def Runtime.Autograd.Tape.linear {α : Type} [Add α] [Mul α] [Zero α] [DecidableEq Spec.Shape] {inDim outDim : } (t : Tape α) (wId bId xId : ) :

                                                                                        Fully-connected linear layer y = W x + b (matvec).

                                                                                        Type-level shapes enforce W : (outDim, inDim), x : (inDim,), b : (outDim,). PyTorch comparison: torch.nn.functional.linear.

                                                                                        Instances For
                                                                                          def Runtime.Autograd.Tape.matmul {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {m n p : } (t : Tape α) (aId bId : ) :

                                                                                          2D matrix multiplication.

                                                                                          PyTorch comparison: torch.matmul(a, b) for 2D tensors.

                                                                                          Instances For
                                                                                            def Runtime.Autograd.Tape.bmm {α : Type} [Add α] [Mul α] [Zero α] [DecidableEq Spec.Shape] {batch m n p : } (t : Tape α) (aId bId : ) :

                                                                                            Batched matrix multiplication.

                                                                                            PyTorch comparison: torch.bmm(a, b).

                                                                                            Instances For
                                                                                              def Runtime.Autograd.Tape.concatVectors {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {n m : } (t : Tape α) (aId bId : ) :

                                                                                              Concatenate two 1D vectors along dimension 0.

                                                                                              PyTorch comparison: torch.cat([a, b], dim=0) for vectors.

                                                                                              Instances For
                                                                                                def Runtime.Autograd.Tape.concatDim0 {α : Type} [DecidableEq Spec.Shape] {n m : } {s : Spec.Shape} (t : Tape α) (aId bId : ) :

                                                                                                Concatenate two tensors along dimension 0.

                                                                                                PyTorch comparison: torch.cat([a, b], dim=0).

                                                                                                Instances For
                                                                                                  def Runtime.Autograd.Tape.sliceRange0 {α : Type} [Zero α] [DecidableEq Spec.Shape] {n : } {s : Spec.Shape} (t : Tape α) (xId start len : ) (h : len + start n) :

                                                                                                  Slice along dimension 0: x[start : start+len].

                                                                                                  The proof argument h enforces bounds. PyTorch comparison: x[start:start+len] on tensors with a leading dimension.

                                                                                                  Instances For
                                                                                                    def Runtime.Autograd.Tape.conv {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {d inC outC : } {kernel stride padding inSpatial : Vector d} (t : Tape α) (kernelId biasId inputId : ) (name : String := "conv") :

                                                                                                    N-D convolution for channels-first tensors (inC, spatial...) (no batch axis).

                                                                                                    This is the generic counterpart to conv2d; conv2d is implemented as a specialization with d = 2, scalar stride, and scalar padding.

                                                                                                    Instances For
                                                                                                      def Runtime.Autograd.Tape.conv2d {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {inC outC kH kW stride padding inH inW : } {h1 : inC 0} {h2 : kH 0} {h3 : kW 0} (t : Tape α) (kernelId biasId inputId : ) :

                                                                                                      2D convolution for channel-first images (inC,inH,inW) (no batch axis).

                                                                                                      PyTorch comparison: torch.nn.functional.conv2d specialized to a single image.

                                                                                                      Instances For
                                                                                                        def Runtime.Autograd.Tape.convTranspose {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {d inC outC : } {kernel stride padding inSpatial : Vector d} (t : Tape α) (kernelId biasId inputId : ) (name : String := "conv_transpose") :

                                                                                                        N-D transpose convolution for channels-first tensors (inC, spatial...) (no batch axis).

                                                                                                        This is the generic counterpart to conv_transpose2d.

                                                                                                        Kernel layout matches the spec/PyTorch convention (inC, outC, kernel[0], ..., kernel[d-1]).

                                                                                                        PyTorch comparison: torch.nn.functional.conv_transpose{d}d specialized to a single sample (no batch axis).

                                                                                                        Instances For
                                                                                                          def Runtime.Autograd.Tape.convTranspose2d {α : Type} [Context α] [DecidableEq Spec.Shape] {inC outC kH kW stride padding inH inW : } {h1 : inC 0} {h2 : kH 0} {h3 : kW 0} (t : Tape α) (kernelId biasId inputId : ) :

                                                                                                          2D transpose convolution for channel-first images (inC,inH,inW) (no batch axis).

                                                                                                          This is implemented as a specialization of conv_transpose with d = 2, scalar stride, and scalar padding. Kernel layout matches the spec/PyTorch convention (inC,outC,kH,kW).

                                                                                                          PyTorch comparison: torch.nn.functional.conv_transpose2d specialized to a single image.

                                                                                                          Instances For
                                                                                                            def Runtime.Autograd.Tape.maxPool {α : Type} [Context α] [DecidableEq Spec.Shape] {d C : } {inSpatial kernel stride padding : Vector d} {hKernel : ∀ (i : Fin d), kernel.get i 0} (t : Tape α) (xId : ) :

                                                                                                            N-D max pooling for channels-first tensors (C, spatial...) (no batch axis).

                                                                                                            Padding is symmetric per-axis and uses zeros. To model unpadded pooling, pass padding := 0 on every axis.

                                                                                                            Instances For
                                                                                                              def Runtime.Autograd.Tape.avgPool {α : Type} [Context α] [DecidableEq Spec.Shape] {d C : } {inSpatial kernel stride padding : Vector d} (hKernel : ∀ (i : Fin d), kernel.get i 0) (t : Tape α) (xId : ) :

                                                                                                              N-D average pooling for channels-first tensors (C, spatial...) (no batch axis).

                                                                                                              Padding is symmetric per-axis and uses zeros; pooling uses count_include_pad=true semantics.

                                                                                                              Instances For
                                                                                                                def Runtime.Autograd.Tape.smoothMaxPool {α : Type} [Context α] [DecidableEq Spec.Shape] {d C : } {inSpatial kernel stride padding : Vector d} {hKernel : ∀ (i : Fin d), kernel.get i 0} (t : Tape α) (xId : ) (beta : α) :

                                                                                                                N-D smooth max pooling (log-sum-exp surrogate) for channels-first tensors (C, spatial...).

                                                                                                                Instances For
                                                                                                                  def Runtime.Autograd.Tape.maxPool2d {α : Type} [Context α] [DecidableEq Spec.Shape] {kH kW inH inW inC stride : } {h1 : kH 0} {h2 : kW 0} (t : Tape α) (xId : ) :

                                                                                                                  2D max-pooling for channel-first images (no batch axis).

                                                                                                                  PyTorch comparison: torch.nn.functional.max_pool2d.

                                                                                                                  Instances For
                                                                                                                    def Runtime.Autograd.Tape.maxPool2dPad {α : Type} [Context α] [DecidableEq Spec.Shape] {kH kW inH inW inC stride padding : } {h1 : kH 0} {h2 : kW 0} (t : Tape α) (xId : ) :

                                                                                                                    2D max-pooling with padding for channel-first images (no batch axis).

                                                                                                                    PyTorch comparison: max_pool2d(..., padding=...).

                                                                                                                    Instances For
                                                                                                                      def Runtime.Autograd.Tape.smoothMaxPool2d {α : Type} [Context α] [DecidableEq Spec.Shape] {kH kW inH inW inC stride : } {h1 : kH 0} {h2 : kW 0} (t : Tape α) (xId : ) (beta : α) :

                                                                                                                      Smooth approximation of max-pooling (softmax pooling).

                                                                                                                      This is not a standard PyTorch primitive; it is useful for differentiable relaxations.

                                                                                                                      Instances For
                                                                                                                        def Runtime.Autograd.Tape.avgPool2d {α : Type} [Context α] [DecidableEq Spec.Shape] {kH kW inH inW inC stride : } (h1 : kH 0) (h2 : kW 0) (t : Tape α) (xId : ) :

                                                                                                                        2D average-pooling for channel-first images (no batch axis).

                                                                                                                        PyTorch comparison: torch.nn.functional.avg_pool2d.

                                                                                                                        Instances For
                                                                                                                          def Runtime.Autograd.Tape.avgPool2dPad {α : Type} [Context α] [DecidableEq Spec.Shape] {kH kW inH inW inC stride padding : } (h1 : kH 0) (h2 : kW 0) (t : Tape α) (xId : ) :

                                                                                                                          2D average-pooling with padding for channel-first images (no batch axis).

                                                                                                                          PyTorch comparison: avg_pool2d(..., padding=...).

                                                                                                                          Instances For
                                                                                                                            def Runtime.Autograd.Tape.layerNorm {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {seqLen embedDim : } (h_seq_pos : seqLen > 0) (h_embed_pos : embedDim > 0) (t : Tape α) (xId gammaId betaId : ) :

                                                                                                                            Layer normalization for (seqLen, embedDim) tensors.

                                                                                                                            This records a single node whose backward returns gradients for x, gamma, and beta. PyTorch comparison: torch.nn.LayerNorm(embedDim) (applied per token) / functional.layer_norm.

                                                                                                                            Instances For
                                                                                                                              def Runtime.Autograd.Tape.batchnormChannelFirst {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {channels height width : } (h_c : channels > 0) (h_h : height > 0) (h_w : width > 0) (t : Tape α) (xId gammaId betaId : ) :

                                                                                                                              Batch normalization for channel-first images (C,H,W) (no batch axis).

                                                                                                                              PyTorch comparison: conceptually torch.nn.BatchNorm2d(C) / functional.batch_norm on NCHW, but specialized here to a single image.

                                                                                                                              Instances For
                                                                                                                                def Runtime.Autograd.Tape.multiHeadAttention {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [DecidableEq Spec.Shape] {n numHeads dModel headDim : } (h1 : n 0) (t : Tape α) (wqId wkId wvId woId xId : ) (mask : Option (Spec.Tensor Bool (Spec.Shape.dim n (Spec.Shape.dim n Spec.Shape.scalar))) := none) :

                                                                                                                                Multi-head self-attention.

                                                                                                                                This is a shape-specialized attention primitive used by transformer-style models. It depends on an optional boolean (n,n) mask and returns the attended output of shape (n,dModel).

                                                                                                                                PyTorch comparison: similar to torch.nn.MultiheadAttention / scaled dot-product attention.

                                                                                                                                Instances For

                                                                                                                                  Elementwise logistic sigmoid activation.

                                                                                                                                  This builds a tape node whose forward pass is Activation.sigmoid_spec, and whose backward pass multiplies the upstream gradient by Activation.sigmoid_deriv_spec (i.e. σ(x) * (1 - σ(x)), pointwise).

                                                                                                                                  PyTorch comparison: torch.sigmoid / torch.nn.functional.sigmoid. Reference: https://pytorch.org/docs/stable/generated/torch.sigmoid.html

                                                                                                                                  Instances For

                                                                                                                                    Elementwise hyperbolic tangent activation.

                                                                                                                                    Forward uses Activation.tanh_spec; backward uses Activation.tanh_deriv_spec (pointwise derivative, usually 1 - tanh(x)^2).

                                                                                                                                    PyTorch comparison: torch.tanh. Reference: https://pytorch.org/docs/stable/generated/torch.tanh.html

                                                                                                                                    Instances For

                                                                                                                                      Softmax along the last axis (recursing over outer dimensions).

                                                                                                                                      This matches Activation.softmax_spec (which applies softmax to the final dimension and recurses over earlier dimensions). The backward pass uses the standard Jacobian-vector product implemented by Activation.softmax_backward_spec, avoiding materializing an n×n Jacobian per slice.

                                                                                                                                      PyTorch comparison: torch.softmax(x, dim=-1). Reference: https://pytorch.org/docs/stable/generated/torch.softmax.html

                                                                                                                                      Instances For

                                                                                                                                        Stable log-softmax along the last axis.

                                                                                                                                        Unlike log (softmax x), this uses Activation.logSoftmaxSpec, i.e. the max-shifted x - max(x) - log(sum(exp(x - max(x)))) formulation. That matches the numerical contract of torch.nn.functional.log_softmax and is the right primitive for cross-entropy on logits.

                                                                                                                                        Instances For

                                                                                                                                          Elementwise softplus activation.

                                                                                                                                          Forward uses Activation.softplus_spec; backward uses Activation.softplus_deriv_spec.

                                                                                                                                          PyTorch comparison: torch.nn.functional.softplus. Reference: https://pytorch.org/docs/stable/generated/torch.nn.functional.softplus.html

                                                                                                                                          Instances For

                                                                                                                                            Elementwise exponential.

                                                                                                                                            Forward uses exp_spec; backward multiplies by exp(x) (pointwise), i.e. d/dx exp(x) = exp(x).

                                                                                                                                            PyTorch comparison: torch.exp. Reference: https://pytorch.org/docs/stable/generated/torch.exp.html

                                                                                                                                            Instances For

                                                                                                                                              Elementwise natural logarithm.

                                                                                                                                              Forward uses log_spec; backward multiplies by 1/x (pointwise), i.e. d/dx log(x) = 1/x (on its mathematical domain; this runtime does not model NaNs/Infs explicitly).

                                                                                                                                              PyTorch comparison: torch.log. Reference: https://pytorch.org/docs/stable/generated/torch.log.html

                                                                                                                                              Instances For

                                                                                                                                                Elementwise reciprocal x ↦ 1/x.

                                                                                                                                                Backward implements d/dx (x⁻¹) = -(x⁻¹)² (pointwise).

                                                                                                                                                PyTorch comparison: torch.reciprocal. Reference: https://pytorch.org/docs/stable/generated/torch.reciprocal.html

                                                                                                                                                Instances For
                                                                                                                                                  def Runtime.Autograd.Tape.safeLog {α : Type} [Context α] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (xId : ) (ε : α := Numbers.epsilon) :

                                                                                                                                                  Elementwise "safe log" that protects against log(0) by adding a small ε internally.

                                                                                                                                                  This uses Activation.safe_log_spec and Activation.safe_log_deriv_spec. The exact behavior is controlled by the spec-layer definition; conceptually it is similar to log(x + ε) used in numerically-stable losses.

                                                                                                                                                  PyTorch comparison: commonly written as torch.log(x + eps) in user code (there is no single dedicated torch.safe_log primitive).

                                                                                                                                                  Instances For
                                                                                                                                                    def Runtime.Autograd.Tape.sum {α : Type} [Add α] [Zero α] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (xId : ) :

                                                                                                                                                    Reduce-sum over all entries, producing a scalar node.

                                                                                                                                                    Backward replicates the upstream scalar gradient to every entry of the input tensor (i.e. d/dx Σ_i x_i = 1 per coordinate).

                                                                                                                                                    PyTorch comparison: torch.sum(x) with dim=None. Reference: https://pytorch.org/docs/stable/generated/torch.sum.html

                                                                                                                                                    Instances For
                                                                                                                                                      def Runtime.Autograd.Tape.mseSpecBasic {α : Type} [Add α] [Sub α] [Mul α] [Div α] [Zero α] [Coe α] {s : Spec.Shape} (predicted target : Spec.Tensor α s) :
                                                                                                                                                      α

                                                                                                                                                      Mean-squared error (MSE) scalar loss with "mean" reduction over all entries.

                                                                                                                                                      mse_spec_basic is the scalar loss (Σ_i (yhat_i - target_i)^2) / N where N = Shape.size s. This matches the default reduction of torch.nn.functional.mse_loss(..., reduction="mean").

                                                                                                                                                      Note: the derivative is defined everywhere in this spec-level setting; we do not model NaNs/Infs.

                                                                                                                                                      Instances For
                                                                                                                                                        def Runtime.Autograd.Tape.mseDerivSpecBasic {α : Type} [Add α] [Sub α] [Mul α] [Div α] [Zero α] [One α] [Coe α] {s : Spec.Shape} (predicted target : Spec.Tensor α s) :

                                                                                                                                                        Gradient of mse_spec_basic with respect to predicted (same shape as the inputs).

                                                                                                                                                        If mse = (Σ_i (yhat_i - target_i)^2) / N, then: ∂mse/∂yhat = (2/N) * (yhat - target).

                                                                                                                                                        Instances For
                                                                                                                                                          def Runtime.Autograd.Tape.mseLoss {α : Type} [Add α] [Sub α] [Mul α] [Div α] [Zero α] [One α] [Coe α] [DecidableEq Spec.Shape] {s : Spec.Shape} (t : Tape α) (yhatId targetId : ) :

                                                                                                                                                          Tape node for MSE loss with "mean" reduction.

                                                                                                                                                          The forward value is a scalar. The backward pass returns gradients for both inputs: dL/dyhat from mse_deriv_spec_basic, and dL/dtarget = - dL/dyhat.

                                                                                                                                                          PyTorch comparison: torch.nn.functional.mse_loss. Reference: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html

                                                                                                                                                          Instances For

                                                                                                                                                            Backpropagation #

                                                                                                                                                            Reverse-mode is implemented by traversing node ids in reverse order. Each node’s backward closure produces parent-gradient contributions, which we accumulate by elementwise summation.

                                                                                                                                                            def Runtime.Autograd.Tape.addGradDense {α : Type} [Add α] [DecidableEq Spec.Shape] (t : Tape α) (grads : Array (Option (AnyTensor α))) (id : ) (g : AnyTensor α) :

                                                                                                                                                            Internal helper: add a single parent gradient contribution into the dense optional gradient array.

                                                                                                                                                            This is where we implement PyTorch-style accumulation for DAGs: if multiple children contribute to the same parent id, we sum the contributions.

                                                                                                                                                            The dense array entry is none until we first reach a node during reverse traversal.

                                                                                                                                                            Instances For
                                                                                                                                                              def Runtime.Autograd.Tape.backwardDense {α : Type} [Add α] [DecidableEq Spec.Shape] (t : Tape α) (outId : ) (seed : AnyTensor α) :

                                                                                                                                                              Reverse-mode backpropagation producing a dense array of optional gradients.

                                                                                                                                                              • The result array has length t.nodes.size.
                                                                                                                                                              • Entry id is some g if the node was reached from outId during reverse traversal, otherwise none.
                                                                                                                                                              • When multiple paths contribute to the same node, we sum gradients via AnyTensor.add.

                                                                                                                                                              This is loosely analogous to PyTorch's autograd engine walking the dynamic graph and accumulating .grad for leaf tensors, but we keep gradients for every node id, not just leaves. That makes the runtime easier to debug and gives proof-bridge code direct access to intermediate cotangents.

                                                                                                                                                              Reference (PyTorch): https://pytorch.org/docs/stable/notes/autograd.html

                                                                                                                                                              Instances For
                                                                                                                                                                def Runtime.Autograd.Tape.addGradAll {α : Type} [Add α] [DecidableEq Spec.Shape] (t : Tape α) (grads : Array (AnyTensor α)) (id : ) (g : AnyTensor α) :

                                                                                                                                                                Internal helper: like addGradDense, but assumes the gradient array is total (no Option).

                                                                                                                                                                This is used by the proof-friendly variants (backwardDenseFrom*, backwardDenseAll) that keep an explicit zero tensor for nodes that do not receive gradients.

                                                                                                                                                                Instances For

                                                                                                                                                                  One reverse-mode backprop step at a single node id, updating a total dense gradient array.

                                                                                                                                                                  Precondition by convention: acc has one entry per tape node, and every entry has the matching node shape. The function checks those conditions dynamically and returns an error if a caller violates them. This makes it suitable as the small proof-friendly step used by backwardDenseFromLoop.

                                                                                                                                                                  Instances For

                                                                                                                                                                    Reverse-mode accumulation over the first n nodes in reverse order.

                                                                                                                                                                    The recursion visits node ids n-1, n-2, ..., 0. Passing n = t.nodes.size therefore traverses the entire tape. This structurally recursive loop is the one used by proof-linked compiled sessions.

                                                                                                                                                                    Instances For

                                                                                                                                                                      Reverse-mode accumulation starting from an explicit dense gradient array.

                                                                                                                                                                      This is a proof-friendly variant: it always runs every node (in reverse order) and keeps a gradient tensor for every node id.

                                                                                                                                                                      Instances For
                                                                                                                                                                        def Runtime.Autograd.Tape.backwardDenseAll {α : Type} [Add α] [Zero α] [DecidableEq Spec.Shape] (t : Tape α) (outId : ) (seed : AnyTensor α) :

                                                                                                                                                                        Reverse-mode accumulation that returns a dense gradient array for every node id.

                                                                                                                                                                        This differs from backwardDense: instead of leaving entries as none until they are reached, it initializes a zero gradient tensor for each node. This matches the proof-level tape model where gradients are explicit (zero for unused nodes).

                                                                                                                                                                        Instances For

                                                                                                                                                                          Convert the optional dense gradient array returned by backwardDense into a sparse HashMap.

                                                                                                                                                                          Only entries that are present (some (some g)) are kept. This is mainly a convenience for callers that want "just the reached nodes" without carrying nones around.

                                                                                                                                                                          Instances For

                                                                                                                                                                            Reverse-mode backpropagation returning a HashMap of only the nodes that received gradients.

                                                                                                                                                                            This is a convenience wrapper around backwardDense + denseToHashMap.

                                                                                                                                                                            Instances For

                                                                                                                                                                              Backpropagate from a scalar output with seed gradient 1.

                                                                                                                                                                              PyTorch analogy: loss.backward() when loss is a scalar.

                                                                                                                                                                              Instances For