TorchLean API

Docs Home Guide Examples Graphs

NN.Runtime.Autograd.Engine.Cuda.Tape

CUDA Autograd Tape #

Shape-erased CUDA tape machinery for float32 Cuda.Buffer values. This is the GPU/runtime analogue of the CPU autograd tape: it records node ids, runtime shapes, parent links, and backward callbacks, then performs dense reverse-mode accumulation over buffers.

@[reducible, inline]

abbrev Runtime.Autograd.Cuda.Result (α : Type) :

Pure error monad for the CUDA tape. Mirrors Engine/Core.

Instances For

structure Runtime.Autograd.Cuda.AnyBuffer :

Runtime, shape-erased CUDA buffer.

This plays the same role as Runtime.AnyTensor in the CPU tape: it pairs runtime Shape metadata with an opaque Cuda.Buffer handle.

s : Spec.Shape
Runtime shape metadata for the buffer.
buf : Buffer
Device buffer (float32). Length is expected to be Shape.size s.

Instances For

def Runtime.Autograd.Cuda.AnyBuffer.natToU32Checked (n : ℕ) :

Checked conversion of a Nat length to UInt32, erroring on overflow.

Instances For

def Runtime.Autograd.Cuda.AnyBuffer.numelU32 (s : Spec.Shape) :

Number of scalar elements as a UInt32 (checked).

Instances For

def Runtime.Autograd.Cuda.AnyBuffer.zeros (s : Spec.Shape) :

Result AnyBuffer

Allocate a zero-filled buffer for a given shape.

Instances For

def Runtime.Autograd.Cuda.AnyBuffer.add (a b : AnyBuffer) :

Result AnyBuffer

Accumulate two AnyBuffer values by elementwise addition, with a dynamic shape check.

This is used by backprop to sum gradient contributions in DAGs.

Instances For

structure Runtime.Autograd.Cuda.Node :

CUDA tape node representing one recorded computation step.

Fields mirror the CPU Engine/Core node:

value holds the forward buffer.
parents are tape node ids.
backward is a local VJP rule producing parent gradient contributions.

name : Option String
Optional node name for debugging/pretty-printing.
value : AnyBuffer
Forward value computed at this node.
requires_grad : Bool
Whether reverse-mode propagation should visit this node.
parents : List ℕ
Parent node ids (dependencies) in the tape.
cleanup : List Buffer
Forward scratch buffers retained only because this node's backward closure may need them.
The eager runtime releases these buffers explicitly after backprop consumes the tape. This keeps long CUDA training loops from waiting on Lean external-object finalizers for large temporary allocations.
backward : AnyBuffer → Result (List (ℕ × AnyBuffer))
Local VJP rule for this node.
Given an upstream cotangent for value, return a list of (parentId, parentCotangent) contributions (one per parent, usually).

Instances For

structure Runtime.Autograd.Cuda.Tape :

CUDA autograd tape: a grow-only array of nodes. Node ids are array indices.

nodes : Array Node
Tape nodes in evaluation order (id = index).

Instances For

def Runtime.Autograd.Cuda.Tape.empty :

Empty tape (no nodes).

Instances For

def Runtime.Autograd.Cuda.Tape.size (t : Tape) :

Number of nodes stored in the tape.

Instances For

def Runtime.Autograd.Cuda.Tape.getNode? (t : Tape) (id : ℕ) :

Read a node by id (returns none if out of bounds).

Instances For

def Runtime.Autograd.Cuda.Tape.getValue? (t : Tape) (id : ℕ) :

Option AnyBuffer

Read just the stored forward value for a node id.

Instances For

def Runtime.Autograd.Cuda.Tape.addNode (t : Tape) (node : Node) :

Append a node and return its id.

Invariant: the returned id is t.size, the pre-append size of the tape.

Instances For

@[simp]

theorem Runtime.Autograd.Cuda.Tape.addNode_id (t : Tape) (node : Node) :

(t.addNode node).2 = t.size

addNode returns the current tape size as the fresh node id.

@[simp]

theorem Runtime.Autograd.Cuda.Tape.size_addNode (t : Tape) (node : Node) :

(t.addNode node).1.size = t.size + 1

Appending a node increases the tape size by one.

def Runtime.Autograd.Cuda.Tape.leaf (t : Tape) (value : AnyBuffer) (name : Option String := none) (requires_grad : Bool := true) :

Add a leaf node (no parents).

PyTorch comparison: a tensor that enters the graph as a leaf (e.g. input or parameter value).

Instances For

def Runtime.Autograd.Cuda.Tape.requireValue (t : Tape) (id : ℕ) (s : Spec.Shape) :

Read a buffer value from a tape node id, requiring a specific runtime shape.

Fails if:

the id is invalid, or
the stored shape does not match s.

Instances For

def Runtime.Autograd.Cuda.Tape.requireGrad (dLdyAny : AnyBuffer) (expected : Spec.Shape) :

Result AnyBuffer

Require that an upstream gradient matches an expected runtime shape.

This is used inside backward closures to validate/cast the incoming cotangent.

Error message must remain identical to the pre-refactor in all call sites.

Instances For

def Runtime.Autograd.Cuda.Tape.unary (t : Tape) (opName : String) (xId : ℕ) (σ τ : Spec.Shape) (forward : Buffer → Buffer) (backward : Buffer → Buffer → Buffer) :

Result (Tape × ℕ)

Generic constructor for unary ops.

You provide:

forward : Buffer → Buffer
backward : Buffer → Buffer → Buffer (VJP; given input x and upstream dLdy, return dLdx)

Shapes are explicit and checked dynamically.

Instances For

def Runtime.Autograd.Cuda.Tape.binary (t : Tape) (opName : String) (aId bId : ℕ) (σ₁ σ₂ τ : Spec.Shape) (forward : Buffer → Buffer → Buffer) (backward : Buffer → Buffer → Buffer → Buffer × Buffer) :

Result (Tape × ℕ)

Generic constructor for binary ops.

You provide:

forward : Buffer → Buffer → Buffer
backward : Buffer → Buffer → Buffer → (Buffer × Buffer) (VJP; given inputs a, b, and upstream dLdy, return (dLda, dLdb))

Shapes are explicit and checked dynamically.

Instances For

def Runtime.Autograd.Cuda.Tape.addGradAll (t : Tape) (grads : Array AnyBuffer) (id : ℕ) (g : AnyBuffer) :

Result (Array AnyBuffer)

Internal helper: add a gradient contribution g into the dense gradient array at id.

This checks:

id is a valid node id,
the parent requires gradients,
the contribution shape matches the parent's value shape, then accumulates via AnyBuffer.add.

When the parent does not require gradients, we still explicitly release the native contribution buffer. The small branch below makes the returned flag observable to Lean, which prevents the FFI release call from being optimized away as an unused pure value.

Instances For

def Runtime.Autograd.Cuda.Tape.backwardDenseFromStep (t : Tape) (acc : Array AnyBuffer) (id : ℕ) :

Result (Array AnyBuffer)

One reverse-mode backprop step at a single node id, updating the dense gradient array.

The incoming array is total: every tape node has a gradient buffer, initialized to zero unless a later node has already contributed to it. That total representation is convenient for the CUDA runtime because every slot has a concrete device buffer that can be released deterministically.

Instances For

def Runtime.Autograd.Cuda.Tape.backwardDenseFromLoop (t : Tape) :

ℕ → Array AnyBuffer → Result (Array AnyBuffer)

Reverse-mode accumulation over the first n nodes in reverse order.

The recursion visits n-1, n-2, ..., 0; using n = t.size runs the full tape. We keep this as a structural loop rather than a list fold so proof-facing callers can reason about one node step at a time.

Instances For

def Runtime.Autograd.Cuda.Tape.backwardDenseFrom (t : Tape) (grads0 : Array AnyBuffer) :

Result (Array AnyBuffer)

Reverse-mode accumulation starting from an explicit dense gradient array.

This expects grads0.size = t.size. The function is useful for callers that already seeded multiple outputs or want to run a custom cotangent initialization.

Instances For

def Runtime.Autograd.Cuda.Tape.backwardDenseAll (t : Tape) (outId : ℕ) (seed : AnyBuffer) :

Result (Array AnyBuffer)

Reverse-mode accumulation that returns a dense gradient buffer for every node id.

All gradients are initialized to zeros (using each node's runtime Shape), then we seed the output node with seed and traverse the tape in reverse order.

Instances For