Spec-level autograd operation specifications (Spec.OpSpec) #
This file defines a small interface for reverse-mode differentiation:
- an operation spec
OpSpecis a pureforwardmap together with a vector-Jacobian product (VJP)backward; - we provide sequential composition (
compose/>>>) that implements the chain rule.
This is intentionally a spec-layer interface: it is independent of any particular runtime autograd engine or tape representation. The runtime code is free to:
- cache intermediates (PyTorch-style
ctx.save_for_backward), or - recompute them, or
- compile to a graph,
as long as it implements the same mathematical VJP behavior.
PyTorch analogy:
OpSpec.forwardcorresponds to theforward(...)method of atorch.autograd.Function.OpSpec.backwardcorresponds to thebackward(ctx, grad_output)method, except that in TorchLean we pass the inputxexplicitly instead of a mutablectx. At the spec level that is the same information: the derivative may depend on the forward inputs (and sometimes intermediate values).
In TorchLean we deliberately keep this file smaller than a graph IR: OpSpec is the math contract
(forward + VJP + composition). We do not want to invent yet another graph/IR here, because the repo
already has canonical graph representations:
NN/GraphSpec/*(typed DAG structures used by graph-spec examples and some compilers),NN/IR/*(op-tagged IR used by verification tooling),NN/Runtime/Autograd/*(executable tape/graph engines).
When you want "a real graph", use those. When you want "the spec of an op", use OpSpec.
Atomic operation specification (forward + VJP/backward).
backward takes the input x and an upstream gradient dL/dy, and returns dL/dx.
Why this signature:
- Reverse-mode AD never needs a full Jacobian. What it needs is: given an upstream gradient
dL/dy, compute the gradient with respect to the input,dL/dx. That’s exactly whatbackwardencodes (a VJP). - We pass
xtobackwardbecause many derivatives depend on the input value. At the spec level we don’t force a “store intermediates vs recompute” strategy; the runtime system can choose.
forward.
backward.
Instances For
Sequential composition of two ops with the reverse-mode chain rule.
If f : σ → τ and g : τ → υ, their composition is g ∘ f : σ → υ.
For reverse-mode AD, we compose their VJPs:
- given an upstream gradient
dL/dz : υ, - compute
dL/dy : τusingg.backward, - then compute
dL/dx : σusingf.backward.
You can visualize the dataflow as a compact chain:
x --f.forward--> y --g.forward--> z
and the reverse pass as the same chain walked backwards:
dL/dz --g.backward--> dL/dy --f.backward--> dL/dx
This is the core of what PyTorch builds dynamically as a "backward graph" during the forward pass,
except here we keep it as an explicit, pure definition. A runtime engine can still choose whether
to cache the intermediate y = f.forward x or recompute it; the spec states the mathematical VJP.