TorchLean API

NN.Runtime.Autograd.Torch.Core.BackwardOptim

Backward Passes and Optimizers #

Gradient extraction and optimizer updates for eager sessions, including the CUDA paths that keep parameter mirrors and moment buffers on device.

Run reverse-mode backprop on the CUDA tape, returning device gradients for all tape entries.

This is the CUDA analogue of backwardDenseAll, but it does not download gradients back to the host. This is primarily useful for implementing GPU-native optimizer steps.

Instances For

    Run CUDA backward from a scalar loss with seed 1, returning device gradient buffers.

    Instances For

      Accumulate one CUDA gradient contribution into a sparse map.

      Ownership rule: the contribution buffer g is consumed by this function. When a contribution is first inserted into the map, we store an owned copy and release the incoming buffer. That extra copy is intentional: CUDA backward rules are allowed to return a fresh buffer, but view-like rules may also pass through an upstream buffer. Copy-on-insert keeps this sparse accumulator correct for every op without requiring every local VJP to expose aliasing metadata. When a second contribution arrives, we sum into a fresh buffer and release both inputs.

      This rule is what lets sparse CUDA backprop avoid the dense "one zero buffer per tape node" representation without leaking transient gradients across long training loops.

      Instances For

        Run scalar-loss CUDA backprop and return gradients only for trainable parameter leaves.

        The returned map stays on device so CUDA optimizers can update parameters without downloading dense gradient arrays to the host.

        Instances For

          Run reverse-mode backprop and return a dense gradient array for all tape entries.

          seed is the upstream gradient for out (like PyTorch's backward(gradient=...)).

          Instances For

            Run backward from a scalar loss with seed 1.

            PyTorch comparison: loss.backward() for a scalar loss.

            Instances For

              Extract the gradient for a particular TensorRef from a dense gradient array.

              Instances For

                Apply an SGD update to all parameters recorded via use.

                PyTorch comparison: for p in params: p.data -= lr * p.grad.

                Instances For

                  Apply an SGD update to all parameters recorded via use, using CUDA device gradients.

                  This avoids downloading the full dense gradient array and keeps updated parameters in each Param's CUDA mirror. Host tensors are synchronized later by explicit parameter readback.

                  Instances For

                    Apply SGD from a sparse CUDA gradient map.

                    This is the path used by the CUDA trainer. It updates only parameter leaves and avoids allocating zero gradients for every forward activation in the tape.

                    Instances For

                      Device-side Adam moment buffers for one parameter leaf.

                      Instances For
                        @[reducible, inline]

                        Adam moment state keyed by parameter leaf id.

                        Instances For

                          Apply an Adam update to all parameters recorded via use, using CUDA device gradients.

                          This is the CUDA analogue of the generic TorchLean.Optim.adam path. It keeps Adam moments as device buffers and keeps updated parameters in each Param's CUDA mirror, so the next CUDA forward can reuse them without a host upload. Host tensors are synchronized later by explicit readback.

                          Instances For

                            Apply Adam using an already-computed sparse CUDA gradient map.

                            Instances For

                              Apply an AdamW update to all parameters recorded via use, using CUDA device gradients.

                              This mirrors Optim.AdamW.update: moments are formed from the raw gradient, weight decay is applied directly to parameters, then the Adam update is applied. Like adamStepAllCuda, it keeps updated parameter buffers resident on device and only synchronizes the host copy when readback is requested.

                              Instances For

                                Apply AdamW from a sparse CUDA gradient map.

                                The dense-array AdamW function remains available for callers that explicitly ask for all tape gradients, but normal training should use this sparse map so activation gradients can be released as soon as their contributions have been propagated.

                                Instances For