Engine Core #
A small dynamic (DAG) autograd engine.
This is intended to be the "runtime" counterpart to Spec.OpSpec: instead of manually writing
backward passes end-to-end, you build a tape during a forward pass and then call backward.
Design goals:
- Works for arbitrary
Tensor α sshapes usingRuntime.AnyTensorpacking. - Supports DAGs (shared subexpressions): gradients are accumulated by summation.
- Keeps the API compact and shape-safe enough for practical use.
Scope boundaries:
- A fully verified analytic calculus proof (
HasFDerivAtetc.) for all ops. The engine is correct given the local backward rules used to build nodes, and we add small regression checks inNN/Tests/Runtime. - For a PyTorch-style imperative API over this tape, see
NN.Runtime.Autograd.Torch.Core.
References (PyTorch / background reading):
- PyTorch docs:
torch.autogradand "Autograd mechanics": https://pytorch.org/docs/stable/autograd.html https://pytorch.org/docs/stable/notes/autograd.html - "micrograd" (small autograd engine, useful for intuition): https://github.com/karpathy/micrograd
Core Types #
The eager autograd engine is built out of a few small pieces:
Convert an Autograd.Result into an IO action by throwing IO.userError on failure.
This is mainly used by the imperative Torch/TorchLean front-ends to keep their code readable.
Instances For
Pack a typed tensor as a runtime AnyTensor.
This is the primary bridge between the dependently-typed Tensor α s world and the dynamic tape,
which stores heterogeneous shapes in a single array.
Instances For
Cast an AnyTensor to a specific shape, given an equality proof.
This is used after dynamic shape checks (e.g. Tape.requireValue).
Instances For
Accumulate two AnyTensor values by elementwise addition, with a dynamic shape check.
This is the heart of DAG support: if two different paths contribute gradients to the same parent, we sum the contributions.
Instances For
A tape node representing a single tensor value in the recorded computation graph.
Fields:
value: the forward value (shape-erased).parents: ids of parent nodes in the tape.backward: a local VJP rule. Given an upstream cotangent forvalue, it returns a list of(parentId, parentCotangent)contributions (one per parent, usually).
PyTorch comparison: analogous to an autograd Function instance + saved tensors, but here we store
the backward closure directly.
Optional node name used for debugging and pretty-printing.
- value : AnyTensor α
Forward value computed at this node (shape-erased).
- requires_grad : Bool
Whether reverse-mode propagation should visit this node.
If
false, reverse-mode traversal skips this node and does not accumulate gradients into it. Parent node ids (dependencies) in the tape.
Local VJP rule for this node.
Given an upstream cotangent for
value, return a list of(parentId, parentCotangent)contributions. If multiple children contribute to the same parent, the engine will sum contributions viaAnyTensor.add.
Instances For
Autograd tape: a grow-only array of nodes.
Node ids are array indices (Nat). All ops append exactly one node and return its id.
This makes it easy to implement reverse-mode by traversing ids in reverse order.
Tape nodes in evaluation order.
Node ids are array indices (
Nat). Each tape op appends exactly one node and returns its id.
Instances For
Tape Construction #
The Tape namespace provides pure constructors for building a recorded computation graph.
Each op appends exactly one node and returns the updated tape plus the new node id.
If you prefer an implicit tape-threading style, see NN.Runtime.Autograd.Engine.TapeM.
Number of nodes stored in the tape.
Instances For
Add a leaf node (no parents).
PyTorch comparison: a tensor that enters the graph as a leaf (e.g. input or parameter value).
Instances For
Read a typed tensor value from a tape node id.
This is the main "dynamic check" boundary in the eager runtime:
- fails if the id is invalid, or
- fails if the stored runtime shape doesn't match the expected dependent shape
s.
Instances For
Read a typed upstream gradient tensor from a runtime AnyTensor.
This is the backward analogue of Tape.requireValue: it checks that the upstream gradient has the
expected shape τ and then performs the dependent cast.
Instances For
Generic constructor for unary ops.
You provide:
forward : Tensor α σ → Tensor α τbackward : Tensor α σ → Tensor α τ → Tensor α σ(a VJP rule; note it may depend on the input)
The returned node stores the forward value and a backward closure that checks the upstream gradient's shape and returns the parent contribution.
Instances For
Flatten a tensor s into a 1D vector of length Shape.size s.
PyTorch comparison: torch.flatten(x) with start_dim=0.
Instances For
Reshape a tensor while preserving number of elements.
The proof argument h enforces Shape.size s₁ = Shape.size s₂.
PyTorch comparison: x.reshape(new_shape) / x.view(new_shape) (when valid).
Instances For
Transpose a 2D matrix. PyTorch: x.t() / x.transpose(0,1).
Instances For
Permute a 3D tensor (a,b,c) → (b,c,a). PyTorch: x.permute(1,2,0).
Instances For
Permute a 3D tensor (a,b,c) → (c,a,b). PyTorch: x.permute(2,0,1).
Instances For
Swap the last two axes of a 3D tensor (a,b,c) → (a,c,b). PyTorch: x.transpose(1,2).
Instances For
Swap adjacent axes at a given depth inside a general Shape.
This is a more general analogue of transpose operations.
Instances For
Broadcast x : s₁ to s₂ using a proof Shape.CanBroadcastTo s₁ s₂.
PyTorch comparison: implicit broadcasting / x.expand(...).
Instances For
Sum-reduce along axis.
PyTorch comparison: torch.sum(x, dim=axis).
Instances For
Mean-reduce along axis.
Backward rule: broadcast the upstream cotangent back to s and divide by the reduced dimension.
PyTorch comparison: torch.mean(x, dim=axis).
Instances For
Gather a scalar from a 1D vector using a compile-time index Fin n.
PyTorch comparison: x[i] (1D indexing).
Instances For
Gather a row from a 2D matrix using a compile-time index Fin rows.
PyTorch comparison: x[i] for 2D tensors (row indexing).
Instances For
Gather a scalar from a 1D vector using a runtime Nat index.
Out-of-bounds indices are totalized to return 0.
PyTorch comparison: x[i] would raise on out-of-range; here we return 0 to keep the op total.
Instances For
Gather k scalars from a 1D vector using an explicit index tensor.
Out-of-bounds indices are totalized to 0. In the backward pass, gradients are accumulated for
repeated indices (scatter-add semantics).
PyTorch comparison: related to torch.gather / advanced indexing.
Instances For
Gather k rows from a 2D matrix using an explicit index tensor.
Out-of-bounds indices are totalized to zero rows; backward accumulates gradients into selected rows (scatter-add), including repeated indices.
Instances For
Scatter-add into a vector: return a copy of x with x[i] += v.
Backward: gradient w.r.t. x is the upstream dL/dy, and gradient w.r.t. v is the gathered
scalar dL/dy[i].
Instances For
Scatter-add into a matrix row: return a copy of x with x[i,:] += v.
Backward: gradient w.r.t. v is the gathered row dL/dy[i,:].
Instances For
Elementwise addition. PyTorch: torch.add / +.
Instances For
Elementwise subtraction. PyTorch: torch.sub / -.
Instances For
Elementwise multiplication. PyTorch: torch.mul / *.
Instances For
Multiply a tensor by a scalar constant. PyTorch: x * c for Python scalar c.
Instances For
Elementwise absolute value.
Backward uses the sign function (sign_spec) as a subgradient at 0.
PyTorch comparison: torch.abs.
Instances For
Elementwise square root.
Backward uses 1 / (2 * sqrt(x)) for x > 0 and 0 otherwise (totalized).
PyTorch comparison: torch.sqrt.
Instances For
Elementwise clamp to [minVal, maxVal].
Backward multiplies by an indicator of the open interval (minVal, maxVal) (zero at boundaries).
PyTorch comparison: torch.clamp.
Instances For
Elementwise maximum.
Tie-breaking: when a = b, the upstream gradient is split evenly (0.5) between both inputs.
PyTorch comparison: torch.maximum.
Instances For
Elementwise minimum.
Tie-breaking: when a = b, the upstream gradient is split evenly (0.5) between both inputs.
PyTorch comparison: torch.minimum.
Instances For
Elementwise ReLU.
PyTorch comparison: torch.relu(x) / torch.nn.functional.relu(x).
Instances For
Fully-connected linear layer y = W x + b (matvec).
Type-level shapes enforce W : (outDim, inDim), x : (inDim,), b : (outDim,).
PyTorch comparison: torch.nn.functional.linear.
Instances For
2D matrix multiplication.
PyTorch comparison: torch.matmul(a, b) for 2D tensors.
Instances For
Batched matrix multiplication.
PyTorch comparison: torch.bmm(a, b).
Instances For
Concatenate two 1D vectors along dimension 0.
PyTorch comparison: torch.cat([a, b], dim=0) for vectors.
Instances For
Concatenate two tensors along dimension 0.
PyTorch comparison: torch.cat([a, b], dim=0).
Instances For
Slice along dimension 0: x[start : start+len].
The proof argument h enforces bounds.
PyTorch comparison: x[start:start+len] on tensors with a leading dimension.
Instances For
N-D convolution for channels-first tensors (inC, spatial...) (no batch axis).
This is the generic counterpart to conv2d; conv2d is implemented as a specialization with
d = 2, scalar stride, and scalar padding.
Instances For
2D convolution for channel-first images (inC,inH,inW) (no batch axis).
PyTorch comparison: torch.nn.functional.conv2d specialized to a single image.
Instances For
N-D transpose convolution for channels-first tensors (inC, spatial...) (no batch axis).
This is the generic counterpart to conv_transpose2d.
Kernel layout matches the spec/PyTorch convention (inC, outC, kernel[0], ..., kernel[d-1]).
PyTorch comparison: torch.nn.functional.conv_transpose{d}d specialized to a single sample
(no batch axis).
Instances For
2D transpose convolution for channel-first images (inC,inH,inW) (no batch axis).
This is implemented as a specialization of conv_transpose with d = 2, scalar stride, and
scalar padding.
Kernel layout matches the spec/PyTorch convention (inC,outC,kH,kW).
PyTorch comparison: torch.nn.functional.conv_transpose2d specialized to a single image.
Instances For
N-D max pooling for channels-first tensors (C, spatial...) (no batch axis).
Padding is symmetric per-axis and uses zeros. To model unpadded pooling, pass padding := 0 on
every axis.
Instances For
N-D average pooling for channels-first tensors (C, spatial...) (no batch axis).
Padding is symmetric per-axis and uses zeros; pooling uses count_include_pad=true semantics.
Instances For
N-D smooth max pooling (log-sum-exp surrogate) for channels-first tensors (C, spatial...).
Instances For
2D max-pooling for channel-first images (no batch axis).
PyTorch comparison: torch.nn.functional.max_pool2d.
Instances For
2D max-pooling with padding for channel-first images (no batch axis).
PyTorch comparison: max_pool2d(..., padding=...).
Instances For
Smooth approximation of max-pooling (softmax pooling).
This is not a standard PyTorch primitive; it is useful for differentiable relaxations.
Instances For
2D average-pooling for channel-first images (no batch axis).
PyTorch comparison: torch.nn.functional.avg_pool2d.
Instances For
2D average-pooling with padding for channel-first images (no batch axis).
PyTorch comparison: avg_pool2d(..., padding=...).
Instances For
Layer normalization for (seqLen, embedDim) tensors.
This records a single node whose backward returns gradients for x, gamma, and beta.
PyTorch comparison: torch.nn.LayerNorm(embedDim) (applied per token) / functional.layer_norm.
Instances For
Batch normalization for channel-first images (C,H,W) (no batch axis).
PyTorch comparison: conceptually torch.nn.BatchNorm2d(C) / functional.batch_norm on NCHW, but
specialized here to a single image.
Instances For
Multi-head self-attention.
This is a shape-specialized attention primitive used by transformer-style models. It depends on an
optional boolean (n,n) mask and returns the attended output of shape (n,dModel).
PyTorch comparison: similar to torch.nn.MultiheadAttention / scaled dot-product attention.
Instances For
Elementwise logistic sigmoid activation.
This builds a tape node whose forward pass is Activation.sigmoid_spec, and whose backward pass
multiplies the upstream gradient by Activation.sigmoid_deriv_spec (i.e. σ(x) * (1 - σ(x)),
pointwise).
PyTorch comparison: torch.sigmoid / torch.nn.functional.sigmoid.
Reference: https://pytorch.org/docs/stable/generated/torch.sigmoid.html
Instances For
Elementwise hyperbolic tangent activation.
Forward uses Activation.tanh_spec; backward uses Activation.tanh_deriv_spec (pointwise
derivative, usually 1 - tanh(x)^2).
PyTorch comparison: torch.tanh.
Reference: https://pytorch.org/docs/stable/generated/torch.tanh.html
Instances For
Softmax along the last axis (recursing over outer dimensions).
This matches Activation.softmax_spec (which applies softmax to the final dimension and recurses
over earlier dimensions). The backward pass uses the standard Jacobian-vector product implemented
by Activation.softmax_backward_spec, avoiding materializing an n×n Jacobian per slice.
PyTorch comparison: torch.softmax(x, dim=-1).
Reference: https://pytorch.org/docs/stable/generated/torch.softmax.html
Instances For
Stable log-softmax along the last axis.
Unlike log (softmax x), this uses Activation.logSoftmaxSpec, i.e. the max-shifted
x - max(x) - log(sum(exp(x - max(x)))) formulation. That matches the numerical contract of
torch.nn.functional.log_softmax and is the right primitive for cross-entropy on logits.
Instances For
Elementwise softplus activation.
Forward uses Activation.softplus_spec; backward uses Activation.softplus_deriv_spec.
PyTorch comparison: torch.nn.functional.softplus.
Reference: https://pytorch.org/docs/stable/generated/torch.nn.functional.softplus.html
Instances For
Elementwise exponential.
Forward uses exp_spec; backward multiplies by exp(x) (pointwise), i.e. d/dx exp(x) = exp(x).
PyTorch comparison: torch.exp.
Reference: https://pytorch.org/docs/stable/generated/torch.exp.html
Instances For
Elementwise natural logarithm.
Forward uses log_spec; backward multiplies by 1/x (pointwise), i.e. d/dx log(x) = 1/x
(on its mathematical domain; this runtime does not model NaNs/Infs explicitly).
PyTorch comparison: torch.log.
Reference: https://pytorch.org/docs/stable/generated/torch.log.html
Instances For
Elementwise reciprocal x ↦ 1/x.
Backward implements d/dx (x⁻¹) = -(x⁻¹)² (pointwise).
PyTorch comparison: torch.reciprocal.
Reference: https://pytorch.org/docs/stable/generated/torch.reciprocal.html
Instances For
Elementwise "safe log" that protects against log(0) by adding a small ε internally.
This uses Activation.safe_log_spec and Activation.safe_log_deriv_spec. The exact behavior is
controlled by the spec-layer definition; conceptually it is similar to log(x + ε) used in
numerically-stable losses.
PyTorch comparison: commonly written as torch.log(x + eps) in user code (there is no single
dedicated torch.safe_log primitive).
Instances For
Reduce-sum over all entries, producing a scalar node.
Backward replicates the upstream scalar gradient to every entry of the input tensor (i.e.
d/dx Σ_i x_i = 1 per coordinate).
PyTorch comparison: torch.sum(x) with dim=None.
Reference: https://pytorch.org/docs/stable/generated/torch.sum.html
Instances For
Mean-squared error (MSE) scalar loss with "mean" reduction over all entries.
mse_spec_basic is the scalar loss (Σ_i (yhat_i - target_i)^2) / N where N = Shape.size s.
This matches the default reduction of torch.nn.functional.mse_loss(..., reduction="mean").
Note: the derivative is defined everywhere in this spec-level setting; we do not model NaNs/Infs.
Instances For
Gradient of mse_spec_basic with respect to predicted (same shape as the inputs).
If mse = (Σ_i (yhat_i - target_i)^2) / N, then:
∂mse/∂yhat = (2/N) * (yhat - target).
Instances For
Tape node for MSE loss with "mean" reduction.
The forward value is a scalar. The backward pass returns gradients for both inputs:
dL/dyhat from mse_deriv_spec_basic, and dL/dtarget = - dL/dyhat.
PyTorch comparison: torch.nn.functional.mse_loss.
Reference: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html
Instances For
Backpropagation #
Reverse-mode is implemented by traversing node ids in reverse order. Each node’s backward
closure produces parent-gradient contributions, which we accumulate by elementwise summation.
Internal helper: add a single parent gradient contribution into the dense optional gradient array.
This is where we implement PyTorch-style accumulation for DAGs: if multiple children contribute to the same parent id, we sum the contributions.
The dense array entry is none until we first reach a node during reverse traversal.
Instances For
Reverse-mode backpropagation producing a dense array of optional gradients.
- The result array has length
t.nodes.size. - Entry
idissome gif the node was reached fromoutIdduring reverse traversal, otherwisenone. - When multiple paths contribute to the same node, we sum gradients via
AnyTensor.add.
This is loosely analogous to PyTorch's autograd engine walking the dynamic graph and accumulating
.grad for leaf tensors, but we keep gradients for every node id, not just leaves. That makes the
runtime easier to debug and gives proof-bridge code direct access to intermediate cotangents.
Reference (PyTorch): https://pytorch.org/docs/stable/notes/autograd.html
Instances For
Internal helper: like addGradDense, but assumes the gradient array is total (no Option).
This is used by the proof-friendly variants (backwardDenseFrom*, backwardDenseAll) that keep
an explicit zero tensor for nodes that do not receive gradients.
Instances For
One reverse-mode backprop step at a single node id, updating a total dense gradient array.
Precondition by convention: acc has one entry per tape node, and every entry has the matching
node shape. The function checks those conditions dynamically and returns an error if a caller
violates them. This makes it suitable as the small proof-friendly step used by
backwardDenseFromLoop.
Instances For
Reverse-mode accumulation over the first n nodes in reverse order.
The recursion visits node ids n-1, n-2, ..., 0. Passing n = t.nodes.size therefore traverses the
entire tape. This structurally recursive loop is the one used by proof-linked compiled sessions.
Instances For
Reverse-mode accumulation starting from an explicit dense gradient array.
This is a proof-friendly variant: it always runs every node (in reverse order) and keeps a gradient tensor for every node id.
Instances For
Reverse-mode accumulation that returns a dense gradient array for every node id.
This differs from backwardDense: instead of leaving entries as none until they are reached,
it initializes a zero gradient tensor for each node. This matches the proof-level tape model
where gradients are explicit (zero for unused nodes).
Instances For
Convert the optional dense gradient array returned by backwardDense into a sparse HashMap.
Only entries that are present (some (some g)) are kept. This is mainly a convenience for callers
that want "just the reached nodes" without carrying nones around.
Instances For
Reverse-mode backpropagation returning a HashMap of only the nodes that received gradients.
This is a convenience wrapper around backwardDense + denseToHashMap.
Instances For
Backpropagate from a scalar output with seed gradient 1.
PyTorch analogy: loss.backward() when loss is a scalar.