Eager Session and CUDA Bridge #
Session state, CUDA upload/download conversions, parameter synchronization, and tape lifecycle helpers for the eager Torch-style runtime.
Internal Eager Session #
The eager backend keeps one mutable CPU tape, one mutable CUDA tape, the non-differentiable Nat
environment, and the map from tape leaves back to trainable parameters. The public session-style API
lives in Runtime.Autograd.TorchLean.Session; this module owns the lower-level state it delegates
to.
CUDA Bridge (Upload/Download) #
The CUDA eager tape stores float32 device buffers (Runtime.Autograd.Cuda.Buffer) paired with a
runtime Shape (Runtime.Autograd.Cuda.AnyBuffer).
The Torch eager front-end still presents the spec-level Tensor α s API, so in CUDA mode we need:
- upload:
Tensor α s->Cuda.AnyBuffer(float32, contiguous, row-major) - download:
Cuda.AnyBuffer->Runtime.AnyTensor α
The helper namespace gives CUDA bridge conversions stable call sites and a clear boundary.
Conversions required by the eager CUDA tape path.
- toAnyBuffer {s : Spec.Shape} : Spec.Tensor α s → IO Cuda.AnyBuffer
Upload a spec tensor to a CUDA
AnyBuffer(float32). - ofAnyBuffer : Cuda.AnyBuffer → IO (AnyTensor α)
Download a CUDA
AnyBufferto a runtimeAnyTensor(shape-erased). Convert a scalar constant to a host
Floatfor CUDA kernels (e.g.scale,axpy).
Instances
Float implementation #
Float CUDA conversions: upload/download via row-major FloatArray.
Generic CPU-preserving fallback for scalar types without a CUDA wire-format bridge.
Many TorchLean sessions are scalar-polymorphic on CPU, while the eager CUDA tape stores float32
buffers. This fallback keeps those CPU instantiations usable and fails loudly if a CUDA upload is
requested for a scalar type that has no declared float32 wire representation. Add a
higher-priority TensorConv α instance when a scalar type should be allowed onto the CUDA tape.
Shape helpers for CUDA kernels #
axisMap as an array.
Instances For
Synchronize a CUDA-updated parameter back to its host tensor, if needed.
This synchronization point is explicit. Training hot paths keep parameters resident on device; public readback APIs call this helper before exposing parameter tensors to the Lean side.
Instances For
Store/update the CUDA mirror of a parameter and mark the host tensor stale.
Instances For
Overwrite a host parameter value and invalidate any stale CUDA mirror.
Instances For
Internal eager session: a mutable runtime tape plus side tables.
This is the state needed to offer a PyTorch-like API where "tensors" are opaque references and ops
mutate a hidden tape stored in an IO.Ref.
Notes:
tapestores values and backward closures (Runtime.Autograd.Tape).paramsByLeafremembers which tape leaf ids correspond to trainable parameters (for SGD).natsstores non-differentiableNatinputs used for indexing-like operations.
- opts : Options
Session options controlling backend/device/kernel behavior.
CPU eager tape used when
opts.useGpu = false.CUDA eager tape used when
opts.useGpu = true.- paramsByLeaf : IO.Ref (Std.HashMap ℕ (AnyParam α))
Map from tape leaf ids to trainable parameter objects.
Non-differentiable integer inputs for dynamic indexing operations.
Instances For
Allocate a fresh eager session with an empty tape and empty side tables.
Instances For
Force-free a CUDA buffer allocation; the external finalizer is safe to call twice.
Instances For
Force-release a shape-erased CUDA buffer.
Instances For
CUDA optimizers only need gradients for parameter leaves, not for every intermediate tape node.
CudaGradMap is the sparse representation used by long eager CUDA runs: keys are tape node ids
for parameter leaves and values are device-resident cotangents with the same shape as that leaf.
Instances For
Release current CUDA tape values that are not persistent parameter mirrors.
Eager CUDA training creates ephemeral buffers for forward values and backward workspace. Reset paths
call this before discarding the current tape snapshot, while persistent parameter mirrors remain
owned by their Param objects.
Instances For
Release CUDA tape values after an optimizer step.
Unlike releaseCudaTapeNonParamValues, this may release trainable parameter leaf buffers too. In a
CUDA optimizer step, trainable parameters have already been written to fresh persistent mirrors, so
the leaf buffers from the just-consumed tape are stale. Non-trainable parameter leaves still are
their persistent mirrors, so we keep those cached.
Instances For
Release a dense CUDA gradient array after an optimizer has consumed it.
Instances For
Release a sparse CUDA gradient map after an optimizer has consumed it.
Instances For
Check that a shape-erased CUDA buffer has the number of elements promised by its shape.
Instances For
Make an owned copy of a CUDA buffer after checking its shape metadata.
Instances For
Ask the native allocator to return/free pages after a large CUDA eager step.
Instances For
Reset the tape and side tables.
PyTorch comparison: like starting a fresh forward pass where the autograd graph is discarded.
Instances For
Create a mutable parameter object (not yet on the tape).
To record this parameter on the session tape, call use, which reads the parameter and records it
as a leaf.
Instances For
Read back the concrete tensor value stored at a TensorRef.
This is a dynamic check: we ensure the id exists on the tape and the stored shape matches sh.
Instances For
Record a constant leaf (non-differentiable) on the tape.
PyTorch comparison: like constructing a tensor with requires_grad=False.
Instances For
Stop-gradient boundary.
Forward semantics: identity (detach(x) = x).
Backward semantics: no gradient flows to x.
Instances For
Deterministic U[0,1) tensor generator (seeded).
Instances For
Deterministic {0,1} mask generator (seeded) with a scalar keep-probability input.
Instances For
Use a parameter in the tape by recording its current value as a leaf.
The returned TensorRef is the handle you pass to ops. The leaf id is stored in paramsByLeaf so
optimizer steps (e.g. SGD) can update parameters after backward.
PyTorch comparison: like using a torch.nn.Parameter in a forward pass (it becomes a leaf in the
autograd graph).
Instances For
Record an external input tensor as a leaf on the tape.
PyTorch comparison: like introducing a tensor into the autograd graph with a chosen
requires_grad flag.
Instances For
Record a non-differentiable Nat input in the session environment.
This supports ops that depend on indices/labels that should not receive gradients.
Instances For
Read a previously recorded NatRef.
Instances For
Overwrite a previously recorded NatRef.
Instances For
Convert a Tensor Nat (.dim k .scalar) to an Array Nat.
Used to stage NatVecRef values into the session environment.
Instances For
Record a non-differentiable vector of Nats in the session environment.
Returns a NatVecRef k pointing to the stored block.
Instances For
Read back the vector stored at NatVecRef k.
Instances For
Overwrite the stored vector at NatVecRef k.