TorchLean API

NN.Runtime.Autograd.Engine.Cuda.Tape

CUDA Autograd Tape #

Shape-erased CUDA tape machinery for float32 Cuda.Buffer values. This is the GPU/runtime analogue of the CPU autograd tape: it records node ids, runtime shapes, parent links, and backward callbacks, then performs dense reverse-mode accumulation over buffers.

@[reducible, inline]

Pure error monad for the CUDA tape. Mirrors Engine/Core.

Instances For

    Runtime, shape-erased CUDA buffer.

    This plays the same role as Runtime.AnyTensor in the CPU tape: it pairs runtime Shape metadata with an opaque Cuda.Buffer handle.

    • Runtime shape metadata for the buffer.

    • buf : Buffer

      Device buffer (float32). Length is expected to be Shape.size s.

    Instances For

      Checked conversion of a Nat length to UInt32, erroring on overflow.

      Instances For

        Number of scalar elements as a UInt32 (checked).

        Instances For

          Allocate a zero-filled buffer for a given shape.

          Instances For

            Accumulate two AnyBuffer values by elementwise addition, with a dynamic shape check.

            This is used by backprop to sum gradient contributions in DAGs.

            Instances For

              CUDA tape node representing one recorded computation step.

              Fields mirror the CPU Engine/Core node:

              • value holds the forward buffer.
              • parents are tape node ids.
              • backward is a local VJP rule producing parent gradient contributions.
              • name : Option String

                Optional node name for debugging/pretty-printing.

              • value : AnyBuffer

                Forward value computed at this node.

              • requires_grad : Bool

                Whether reverse-mode propagation should visit this node.

              • parents : List

                Parent node ids (dependencies) in the tape.

              • cleanup : List Buffer

                Forward scratch buffers retained only because this node's backward closure may need them.

                The eager runtime releases these buffers explicitly after backprop consumes the tape. This keeps long CUDA training loops from waiting on Lean external-object finalizers for large temporary allocations.

              • backward : AnyBufferResult (List ( × AnyBuffer))

                Local VJP rule for this node.

                Given an upstream cotangent for value, return a list of (parentId, parentCotangent) contributions (one per parent, usually).

              Instances For

                CUDA autograd tape: a grow-only array of nodes. Node ids are array indices.

                • nodes : Array Node

                  Tape nodes in evaluation order (id = index).

                Instances For

                  Empty tape (no nodes).

                  Instances For

                    Number of nodes stored in the tape.

                    Instances For

                      Read a node by id (returns none if out of bounds).

                      Instances For

                        Read just the stored forward value for a node id.

                        Instances For

                          Append a node and return its id.

                          Invariant: the returned id is t.size, the pre-append size of the tape.

                          Instances For
                            @[simp]
                            theorem Runtime.Autograd.Cuda.Tape.addNode_id (t : Tape) (node : Node) :
                            (t.addNode node).2 = t.size

                            addNode returns the current tape size as the fresh node id.

                            @[simp]
                            theorem Runtime.Autograd.Cuda.Tape.size_addNode (t : Tape) (node : Node) :
                            (t.addNode node).1.size = t.size + 1

                            Appending a node increases the tape size by one.

                            def Runtime.Autograd.Cuda.Tape.leaf (t : Tape) (value : AnyBuffer) (name : Option String := none) (requires_grad : Bool := true) :

                            Add a leaf node (no parents).

                            PyTorch comparison: a tensor that enters the graph as a leaf (e.g. input or parameter value).

                            Instances For

                              Read a buffer value from a tape node id, requiring a specific runtime shape.

                              Fails if:

                              • the id is invalid, or
                              • the stored shape does not match s.
                              Instances For

                                Require that an upstream gradient matches an expected runtime shape.

                                This is used inside backward closures to validate/cast the incoming cotangent.

                                Error message must remain identical to the pre-refactor in all call sites.

                                Instances For
                                  def Runtime.Autograd.Cuda.Tape.unary (t : Tape) (opName : String) (xId : ) (σ τ : Spec.Shape) (forward : BufferBuffer) (backward : BufferBufferBuffer) :

                                  Generic constructor for unary ops.

                                  You provide:

                                  • forward : Buffer → Buffer
                                  • backward : Buffer → Buffer → Buffer (VJP; given input x and upstream dLdy, return dLdx)

                                  Shapes are explicit and checked dynamically.

                                  Instances For
                                    def Runtime.Autograd.Cuda.Tape.binary (t : Tape) (opName : String) (aId bId : ) (σ₁ σ₂ τ : Spec.Shape) (forward : BufferBufferBuffer) (backward : BufferBufferBufferBuffer × Buffer) :

                                    Generic constructor for binary ops.

                                    You provide:

                                    • forward : Buffer → Buffer → Buffer
                                    • backward : Buffer → Buffer → Buffer → (Buffer × Buffer) (VJP; given inputs a, b, and upstream dLdy, return (dLda, dLdb))

                                    Shapes are explicit and checked dynamically.

                                    Instances For

                                      Internal helper: add a gradient contribution g into the dense gradient array at id.

                                      This checks:

                                      • id is a valid node id,
                                      • the parent requires gradients,
                                      • the contribution shape matches the parent's value shape, then accumulates via AnyBuffer.add.

                                      When the parent does not require gradients, we still explicitly release the native contribution buffer. The small branch below makes the returned flag observable to Lean, which prevents the FFI release call from being optimized away as an unused pure value.

                                      Instances For

                                        One reverse-mode backprop step at a single node id, updating the dense gradient array.

                                        The incoming array is total: every tape node has a gradient buffer, initialized to zero unless a later node has already contributed to it. That total representation is convenient for the CUDA runtime because every slot has a concrete device buffer that can be released deterministically.

                                        Instances For

                                          Reverse-mode accumulation over the first n nodes in reverse order.

                                          The recursion visits n-1, n-2, ..., 0; using n = t.size runs the full tape. We keep this as a structural loop rather than a list fold so proof-facing callers can reason about one node step at a time.

                                          Instances For

                                            Reverse-mode accumulation starting from an explicit dense gradient array.

                                            This expects grads0.size = t.size. The function is useful for callers that already seeded multiple outputs or want to run a custom cotangent initialization.

                                            Instances For

                                              Reverse-mode accumulation that returns a dense gradient buffer for every node id.

                                              All gradients are initialized to zeros (using each node's runtime Shape), then we seed the output node with seed and traverse the tape in reverse order.

                                              Instances For