Backward Passes and Optimizers #

Gradient extraction and optimizer updates for eager sessions, including the CUDA paths that keep parameter mirrors and moment buffers on device.

source

def Runtime.Autograd.Torch.Internal.EagerSession.backwardDenseAllCuda {α : Type} [CudaBridge.TensorConv α] (s : EagerSession α) [Add α] [Zero α] [DecidableEq Spec.Shape] {sh : Spec.Shape} (out : TensorRef α sh) (seed : Spec.Tensor α sh) :

IO (Array Cuda.AnyBuffer)

Run reverse-mode backprop on the CUDA tape, returning device gradients for all tape entries.

This is the CUDA analogue of backwardDenseAll, but it does not download gradients back to the host. This is primarily useful for implementing GPU-native optimizer steps.

Instances For

source

def Runtime.Autograd.Torch.Internal.EagerSession.backwardScalarDenseAllCuda {α : Type} [CudaBridge.TensorConv α] (s : EagerSession α) [Add α] [Zero α] [One α] [DecidableEq Spec.Shape] (loss : TensorRef α Spec.Shape.scalar) :

IO (Array Cuda.AnyBuffer)

Run CUDA backward from a scalar loss with seed 1, returning device gradient buffers.

Instances For

source

def Runtime.Autograd.Torch.Internal.EagerSession.addCudaGradToMap (t : Cuda.Tape) (gradsRef : IO.Ref CudaGradMap) (id : ℕ) (g : Cuda.AnyBuffer) :

IO Unit

Accumulate one CUDA gradient contribution into a sparse map.

Ownership rule: the contribution buffer g is consumed by this function. When a contribution is first inserted into the map, we store an owned copy and release the incoming buffer. That extra copy is intentional: CUDA backward rules are allowed to return a fresh buffer, but view-like rules may also pass through an upstream buffer. Copy-on-insert keeps this sparse accumulator correct for every op without requiring every local VJP to expose aliasing metadata. When a second contribution arrives, we sum into a fresh buffer and release both inputs.

This rule is what lets sparse CUDA backprop avoid the dense "one zero buffer per tape node" representation without leaking transient gradients across long training loops.

Instances For

source

def Runtime.Autograd.Torch.Internal.EagerSession.backwardScalarParamGradsCuda {α : Type} [CudaBridge.TensorConv α] (s : EagerSession α) [One α] [DecidableEq Spec.Shape] (loss : TensorRef α Spec.Shape.scalar) :

IO CudaGradMap

Run scalar-loss CUDA backprop and return gradients only for trainable parameter leaves.

The returned map stays on device so CUDA optimizers can update parameters without downloading dense gradient arrays to the host.

Instances For

source

def Runtime.Autograd.Torch.Internal.EagerSession.backwardDenseAll {α : Type} [CudaBridge.TensorConv α] (s : EagerSession α) [Add α] [Zero α] [DecidableEq Spec.Shape] {sh : Spec.Shape} (out : TensorRef α sh) (seed : Spec.Tensor α sh) :

IO (Array (AnyTensor α))

Run reverse-mode backprop and return a dense gradient array for all tape entries.

seed is the upstream gradient for out (like PyTorch's backward(gradient=...)).

Instances For

source

def Runtime.Autograd.Torch.Internal.EagerSession.backwardScalarDenseAll {α : Type} [CudaBridge.TensorConv α] (s : EagerSession α) [Add α] [Zero α] [One α] [DecidableEq Spec.Shape] (loss : TensorRef α Spec.Shape.scalar) :

IO (Array (AnyTensor α))

Run backward from a scalar loss with seed 1.

PyTorch comparison: loss.backward() for a scalar loss.

Instances For

source

def Runtime.Autograd.Torch.Internal.EagerSession.grad {α : Type} {sh : Spec.Shape} [DecidableEq Spec.Shape] (grads : Array (AnyTensor α)) (x : TensorRef α sh) :

IO (Spec.Tensor α sh)

Extract the gradient for a particular TensorRef from a dense gradient array.

Instances For

source

def Runtime.Autograd.Torch.Internal.EagerSession.sgdStepAll {α : Type} [CudaBridge.TensorConv α] (s : EagerSession α) [Sub α] [Mul α] [Add α] [Zero α] [DecidableEq Spec.Shape] (lr : α) (grads : Array (AnyTensor α)) :

IO Unit

Apply an SGD update to all parameters recorded via use.

PyTorch comparison: for p in params: p.data -= lr * p.grad.

Instances For

source

def Runtime.Autograd.Torch.Internal.EagerSession.sgdStepAllCuda {α : Type} [CudaBridge.TensorConv α] (s : EagerSession α) [DecidableEq Spec.Shape] (lr : α) (grads : Array Cuda.AnyBuffer) :

IO Unit

Apply an SGD update to all parameters recorded via use, using CUDA device gradients.

This avoids downloading the full dense gradient array and keeps updated parameters in each Param's CUDA mirror. Host tensors are synchronized later by explicit parameter readback.

Instances For

source

def Runtime.Autograd.Torch.Internal.EagerSession.sgdStepAllCudaMap {α : Type} [CudaBridge.TensorConv α] (s : EagerSession α) [DecidableEq Spec.Shape] (lr : α) (grads : CudaGradMap) :

IO Unit

Apply SGD from a sparse CUDA gradient map.

This is the path used by the CUDA trainer. It updates only parameter leaves and avoids allocating zero gradients for every forward activation in the tape.

Instances For

source

structure Runtime.Autograd.Torch.Internal.EagerSession.CudaAdamParamState :

Type

Device-side Adam moment buffers for one parameter leaf.

m : Cuda.Buffer
First moment buffer.
v : Cuda.Buffer
Second moment buffer.
t : ℕ
Adam step counter for this parameter.

Instances For

source

@[reducible, inline]

abbrev Runtime.Autograd.Torch.Internal.EagerSession.CudaAdamState :

Type

Adam moment state keyed by parameter leaf id.

Instances For

source

def Runtime.Autograd.Torch.Internal.EagerSession.adamStepAllCuda {α : Type} [CudaBridge.TensorConv α] (s : EagerSession α) [DecidableEq Spec.Shape] (stateRef : IO.Ref CudaAdamState) (lr beta1 beta2 epsilon : α) (grads : Array Cuda.AnyBuffer) :

IO Unit

Apply an Adam update to all parameters recorded via use, using CUDA device gradients.

This is the CUDA analogue of the generic TorchLean.Optim.adam path. It keeps Adam moments as device buffers and keeps updated parameters in each Param's CUDA mirror, so the next CUDA forward can reuse them without a host upload. Host tensors are synchronized later by explicit readback.

Instances For

source

def Runtime.Autograd.Torch.Internal.EagerSession.adamStepAllCudaMap {α : Type} [CudaBridge.TensorConv α] (s : EagerSession α) [DecidableEq Spec.Shape] (stateRef : IO.Ref CudaAdamState) (lr beta1 beta2 epsilon : α) (grads : CudaGradMap) :

IO Unit

Apply Adam using an already-computed sparse CUDA gradient map.

Instances For

source

def Runtime.Autograd.Torch.Internal.EagerSession.adamWStepAllCuda {α : Type} [CudaBridge.TensorConv α] (s : EagerSession α) [DecidableEq Spec.Shape] (stateRef : IO.Ref CudaAdamState) (lr weightDecay beta1 beta2 epsilon : α) (grads : Array Cuda.AnyBuffer) :

IO Unit

Apply an AdamW update to all parameters recorded via use, using CUDA device gradients.

This mirrors Optim.AdamW.update: moments are formed from the raw gradient, weight decay is applied directly to parameters, then the Adam update is applied. Like adamStepAllCuda, it keeps updated parameter buffers resident on device and only synchronizes the host copy when readback is requested.

Instances For

source

def Runtime.Autograd.Torch.Internal.EagerSession.adamWStepAllCudaMap {α : Type} [CudaBridge.TensorConv α] (s : EagerSession α) [DecidableEq Spec.Shape] (stateRef : IO.Ref CudaAdamState) (lr weightDecay beta1 beta2 epsilon : α) (grads : CudaGradMap) :

IO Unit

Apply AdamW from a sparse CUDA gradient map.

The dense-array AdamW function remains available for callers that explicitly ask for all tape gradients, but normal training should use this sparse map so activation gradients can be released as soon as their contributions have been propagated.

Instances For

TorchLean API

NN.Runtime.Autograd.Torch.Core.BackwardOptim

Backward Passes and Optimizers #