CUDA Autograd Tape #
Shape-erased CUDA tape machinery for float32 Cuda.Buffer values. This is the GPU/runtime analogue
of the CPU autograd tape: it records node ids, runtime shapes, parent links, and backward callbacks,
then performs dense reverse-mode accumulation over buffers.
Pure error monad for the CUDA tape. Mirrors Engine/Core.
Instances For
Runtime, shape-erased CUDA buffer.
This plays the same role as Runtime.AnyTensor in the CPU tape: it pairs runtime Shape
metadata with an opaque Cuda.Buffer handle.
- s : Spec.Shape
Runtime shape metadata for the buffer.
- buf : Buffer
Instances For
Number of scalar elements as a UInt32 (checked).
Instances For
Allocate a zero-filled buffer for a given shape.
Instances For
CUDA tape node representing one recorded computation step.
Fields mirror the CPU Engine/Core node:
valueholds the forward buffer.parentsare tape node ids.backwardis a local VJP rule producing parent gradient contributions.
Optional node name for debugging/pretty-printing.
- value : AnyBuffer
Forward value computed at this node.
- requires_grad : Bool
Whether reverse-mode propagation should visit this node.
Parent node ids (dependencies) in the tape.
Forward scratch buffers retained only because this node's backward closure may need them.
The eager runtime releases these buffers explicitly after backprop consumes the tape. This keeps long CUDA training loops from waiting on Lean external-object finalizers for large temporary allocations.
Local VJP rule for this node.
Given an upstream cotangent for
value, return a list of(parentId, parentCotangent)contributions (one per parent, usually).
Instances For
CUDA autograd tape: a grow-only array of nodes. Node ids are array indices.
Tape nodes in evaluation order (id = index).
Instances For
Number of nodes stored in the tape.
Instances For
Read a buffer value from a tape node id, requiring a specific runtime shape.
Fails if:
- the id is invalid, or
- the stored shape does not match
s.
Instances For
Require that an upstream gradient matches an expected runtime shape.
This is used inside backward closures to validate/cast the incoming cotangent.
Error message must remain identical to the pre-refactor in all call sites.
Instances For
Generic constructor for unary ops.
You provide:
forward : Buffer → Bufferbackward : Buffer → Buffer → Buffer(VJP; given inputxand upstreamdLdy, returndLdx)
Shapes are explicit and checked dynamically.
Instances For
Generic constructor for binary ops.
You provide:
forward : Buffer → Buffer → Bufferbackward : Buffer → Buffer → Buffer → (Buffer × Buffer)(VJP; given inputsa,b, and upstreamdLdy, return(dLda, dLdb))
Shapes are explicit and checked dynamically.
Instances For
Internal helper: add a gradient contribution g into the dense gradient array at id.
This checks:
idis a valid node id,- the parent requires gradients,
- the contribution shape matches the parent's value shape,
then accumulates via
AnyBuffer.add.
When the parent does not require gradients, we still explicitly release the native contribution buffer. The small branch below makes the returned flag observable to Lean, which prevents the FFI release call from being optimized away as an unused pure value.
Instances For
One reverse-mode backprop step at a single node id, updating the dense gradient array.
The incoming array is total: every tape node has a gradient buffer, initialized to zero unless a later node has already contributed to it. That total representation is convenient for the CUDA runtime because every slot has a concrete device buffer that can be released deterministically.
Instances For
Reverse-mode accumulation over the first n nodes in reverse order.
The recursion visits n-1, n-2, ..., 0; using n = t.size runs the full tape. We keep this as a
structural loop rather than a list fold so proof-facing callers can reason about one node step at a
time.
Instances For
Reverse-mode accumulation starting from an explicit dense gradient array.
This expects grads0.size = t.size. The function is useful for callers that already seeded
multiple outputs or want to run a custom cotangent initialization.
Instances For
Reverse-mode accumulation that returns a dense gradient buffer for every node id.
All gradients are initialized to zeros (using each node's runtime Shape), then we seed the
output node with seed and traverse the tape in reverse order.