StepAlgebra #
Pure training-step algebra for proved autograd graphs.
This file deliberately stays on the mathematical side of training:
Graph.scalarLoss_grad_correctspecializes the tape/DAG backprop theorem to the scalar-loss convention used by training loops.SGD.steplifts the single-tensor SGD equation to a heterogeneousTListof parameters.
It is not another runtime optimizer implementation. Runtime files such as
Runtime.Autograd.Torch.ParamList.sgdStep{,Fast} and CUDA eager sgdStepAllCuda mutate IO.Refs or
device buffers; their intended mathematical update is the pure context equation stated here.
PyTorch correspondence / citations #
- “Scalar loss” convention: backprop starts with upstream gradient 1. https://pytorch.org/docs/stable/autograd.html
- SGD update rule corresponds to
torch.optim.SGDat the level of pure parameter math. https://pytorch.org/docs/stable/generated/torch.optim.SGD.html
Seed cotangent for a scalar-loss graph.
The full context has shape list Γ ++ ss ++ [scalar]: original inputs/parameters, intermediate
nodes, and the final scalar loss. Reverse-mode training seeds all non-output cotangents with zero
and the scalar-loss output cotangent with 1, matching loss.backward() in PyTorch.
Instances For
Scalar-loss specialization of Graph.backprop_correct.
The left side differentiates the graph by a forward-mode perturbation dx and pairs the result
with the scalar-loss seed. The right side pairs the same perturbation with the context cotangent
computed by reverse mode. This is the exact algebraic statement behind “the gradient returned by
backprop is the cotangent for the scalar loss”.
One pure SGD step over a typed parameter context:
params := params - lr * grads
This is the context-level version of the ordinary tensor update
p := p - lr * g. Runtime training code may implement the same equation by mutating parameter
references, materializing tensors eagerly, or launching CUDA kernels; this definition is the
side-effect-free algebraic target those implementations are meant to realize.
Instances For
Cons-form unfolding of a context-level SGD step.
This lemma is intentionally small but useful as documentation: the head tensor update is exactly
subSpec p (scaleSpec g lr), and the tail recursively receives the same learning rate.