Backward Passes and Optimizers #
Gradient extraction and optimizer updates for eager sessions, including the CUDA paths that keep parameter mirrors and moment buffers on device.
Run reverse-mode backprop on the CUDA tape, returning device gradients for all tape entries.
This is the CUDA analogue of backwardDenseAll, but it does not download gradients back to the
host. This is primarily useful for implementing GPU-native optimizer steps.
Instances For
Run CUDA backward from a scalar loss with seed 1, returning device gradient buffers.
Instances For
Accumulate one CUDA gradient contribution into a sparse map.
Ownership rule: the contribution buffer g is consumed by this function. When a contribution is
first inserted into the map, we store an owned copy and release the incoming buffer. That extra copy
is intentional: CUDA backward rules are allowed to return a fresh buffer, but view-like rules may
also pass through an upstream buffer. Copy-on-insert keeps this sparse accumulator correct for every
op without requiring every local VJP to expose aliasing metadata. When a second contribution arrives,
we sum into a fresh buffer and release both inputs.
This rule is what lets sparse CUDA backprop avoid the dense "one zero buffer per tape node" representation without leaking transient gradients across long training loops.
Instances For
Run scalar-loss CUDA backprop and return gradients only for trainable parameter leaves.
The returned map stays on device so CUDA optimizers can update parameters without downloading dense gradient arrays to the host.
Instances For
Run reverse-mode backprop and return a dense gradient array for all tape entries.
seed is the upstream gradient for out (like PyTorch's backward(gradient=...)).
Instances For
Run backward from a scalar loss with seed 1.
PyTorch comparison: loss.backward() for a scalar loss.
Instances For
Extract the gradient for a particular TensorRef from a dense gradient array.
Instances For
Apply an SGD update to all parameters recorded via use.
PyTorch comparison: for p in params: p.data -= lr * p.grad.
Instances For
Apply an SGD update to all parameters recorded via use, using CUDA device gradients.
This avoids downloading the full dense gradient array and keeps updated parameters in each
Param's CUDA mirror. Host tensors are synchronized later by explicit parameter readback.
Instances For
Apply SGD from a sparse CUDA gradient map.
This is the path used by the CUDA trainer. It updates only parameter leaves and avoids allocating zero gradients for every forward activation in the tape.
Instances For
Device-side Adam moment buffers for one parameter leaf.
- m : Cuda.Buffer
First moment buffer.
- v : Cuda.Buffer
Second moment buffer.
- t : ℕ
Adam step counter for this parameter.
Instances For
Adam moment state keyed by parameter leaf id.
Instances For
Apply an Adam update to all parameters recorded via use, using CUDA device gradients.
This is the CUDA analogue of the generic TorchLean.Optim.adam path. It keeps Adam moments as
device buffers and keeps updated parameters in each Param's CUDA mirror, so the next CUDA forward
can reuse them without a host upload. Host tensors are synchronized later by explicit readback.
Instances For
Apply Adam using an already-computed sparse CUDA gradient map.
Instances For
Apply an AdamW update to all parameters recorded via use, using CUDA device gradients.
This mirrors Optim.AdamW.update: moments are formed from the raw gradient, weight decay is applied
directly to parameters, then the Adam update is applied. Like adamStepAllCuda, it keeps updated
parameter buffers resident on device and only synchronizes the host copy when readback is requested.
Instances For
Apply AdamW from a sparse CUDA gradient map.
The dense-array AdamW function remains available for callers that explicitly ask for all tape gradients, but normal training should use this sparse map so activation gradients can be released as soon as their contributions have been propagated.