Autograd OpSpecs (spec layer) #
This file defines small OpSpec building blocks (forward + VJP) for common tensor operations.
The definitions are intentionally direct mathematical contracts and live purely in the spec layer.
How to read this file:
- Each operation below is an
OpSpec: a pureforwardplus a pure VJPbackward. - Most ops here are thin wrappers around
*Specand derivative-spec definitions fromNN/Spec/*.
Where this sits in TorchLean:
NN.Spec.*files define pure denotational semantics: what tensors/layers mean.- This file packages some of those pure definitions as unary
OpSpecs:forwardplus VJP. NN.Runtime.Autograd.*executes programs, tracks parameters, manages tapes/sessions, dispatches CUDA kernels, handles RNG, and compiles graphs.
This file does not mirror every runtime method one-for-one. It is the reusable adapter layer for
operations whose input-gradient VJP is naturally expressed as a single
OpSpec. Larger multi-input/parameterized layers (convolution, attention, batchnorm, pooling, RNG)
still have precise specs and runtime implementations, but their full backward state usually belongs
in layer/runtime code rather than in this compact unary interface.
PyTorch analogy (roughly):
- A spec
OpSpecis like a compacttorch.autograd.Functionwhere we write down the VJP directly. - We do not model PyTorch's mutable
ctx; the spec layer receives the input tensorxdirectly.
Elementwise lifting helpers #
Lift an elementwise backward using the chain rule: dL/dx = df(x) * dL/dy pointwise.
This is the standard VJP pattern for elementwise ops.
PyTorch analogy: the "local backward" rule for a pointwise op multiplies by the derivative mask.
Instances For
Elementwise ELU OpSpec on any shape.
Instances For
Elementwise tanh-approximate GELU OpSpec on any shape.
PyTorch analogy: torch.nn.functional.gelu(x, approximate="tanh").
Instances For
Linear layers #
Linear layer as an OpSpec: y = W x + b.
This OpSpec only returns the input gradient dL/dx. Parameter gradients for W and b
are intentionally not part of OpSpec (those live at the graph/runtime level).
PyTorch analogy: torch.nn.Linear forward, with autograd producing grads for x/W/b.
Instances For
Extract scalar value from a scalar tensor.
We use this when an upstream gradient is a scalar (e.g. for reduced losses). In PyTorch this is
the common pattern "loss is scalar, so grad_output is a scalar too".
Instances For
Generic elementwise binary OpSpec with captured right-hand tensor and d/dx.
This is a "closure style" op: we treat the RHS tensor as a captured constant and only return the VJP with respect to the LHS input.
PyTorch analogy: in a tape/graph, rhs is typically another node; here we are writing the
"lhs-only" derivative for convenience.
Instances For
Unary elementwise ops #
Smooth absolute value (a differentiable surrogate for abs).
This is useful when you want to avoid a kink at 0 in optimization.
PyTorch analogy: there is no single canonical smooth_abs, but it is similar in spirit to
sqrt(x^2 + eps)-style smoothings.
Instances For
Elementwise natural logarithm.
Domain discipline: this is the raw mathematical/PyTorch-style rule. The VJP multiplies by 1/x,
so callers should use it only when the input is strictly positive. Runtime backends are allowed to
reject nonpositive inputs rather than silently manufacture a gradient. Use safeLogOp when the
intended model is log(x + ε).
PyTorch analogy: torch.log(x).
Instances For
Elementwise log with epsilon shift, log(x + ε).
This is the default facade-safe logarithm: it is total as a spec expression and its VJP uses
1/(x+ε) pointwise.
PyTorch analogy: often written manually as torch.log(x + eps).
Instances For
Elementwise square root.
Domain discipline: TorchLean's spec-level sqrtSpec is total by clamping the forward value on
nonpositive inputs. The VJP follows that convention and returns zero where x <= 0, rather than
introducing an artificial 1/ε spike.
PyTorch analogy: torch.sqrt(x) on the positive region, with an explicit TorchLean subgradient
choice outside the classical domain.
Instances For
Elementwise reciprocal, 1/x.
Domain discipline: this is the raw reciprocal. Its VJP is -1/x^2, so callers should use it only
when zero is excluded by the surrounding invariant. Use safeInvOp when the intended model is
1/(x+ε).
PyTorch analogy: torch.reciprocal(x) or 1 / x.
Instances For
Elementwise epsilon-shifted reciprocal, 1/(x+ε).
This is the safe facade counterpart to invOp: the forward pass delegates to safedivSpec with
unit numerator, and the VJP is the derivative of the same shifted expression.
PyTorch analogy: usually written manually as 1.0 / (x + eps).
Instances For
Binary ops capturing a right-hand tensor #
Elementwise divide by a captured RHS tensor.
Domain discipline: this is the raw division rule. The VJP multiplies by 1/rhs, so callers should
only use it when the captured denominator is known nonzero. Use safeDivOp when the
intended model is x/(rhs+ε).
Instances For
Leaky ReLU with slope parameter.
PyTorch analogy: torch.nn.functional.leaky_relu(x, negative_slope=alpha_l).
Instances For
Loss OpSpecs #
MSE loss (returns a scalar), capturing the target.
Instances For
MAE loss (returns a scalar), capturing the target.
Instances For
Huber loss (returns a scalar), capturing the target.
Instances For
Cross-entropy loss (returns a scalar), capturing the target distribution.
This is "cross-entropy between distributions": target is p, yhat is q.
PyTorch analogy: closer to -(p * log(q)).mean() than to the logits-based
torch.nn.functional.cross_entropy default.
Instances For
Logits-based cross-entropy loss, capturing the target distribution.
Instances For
Binary cross-entropy loss on probability tensors, capturing the target tensor.
Instances For
Cosine-similarity loss, capturing the target tensor.
Instances For
Hinge loss (returns a scalar), capturing the target.
Instances For
Poisson loss (returns a scalar), capturing the target.
Instances For
Log-cosh loss (returns a scalar), capturing the target.
Instances For
Normalization #
LayerNorm OpSpec over (seqLen × embedDim). Parameters gamma/beta are captured. Backward returns only ∂L/∂x (parameter grads are not returned at this level).
Instances For
Identity op: pass-through forward and backward.
Instances For
Shape/structure ops #
Matrix transpose (2D) op.
PyTorch analogy: x.transpose(0, 1) for a matrix.
Instances For
Replicate a scalar to any shape; backward sums gradients back to a scalar.
PyTorch analogy: broadcasting a scalar in arithmetic, and in backward accumulating by sum.
Instances For
Right-multiply by fixed matrix: X (m×n) ↦ X·B (m×p).
Instances For
Left-multiply by fixed matrix: X (n×p) ↦ A·X (m×p).
Instances For
Batched matrix multiply with captured RHS: A ↦ A @ B.
Instances For
Batched matrix multiply with captured LHS: B ↦ A @ B.
Instances For
One-hot embedding as an OpSpec over the one-hot input. Parameter gradients stay outside
OpSpec; this wrapper returns only dOneHot.
Instances For
Reductions and broadcasting #
Reduce-sum along axis using a valid_axis proof; backward broadcasts back.
PyTorch analogy: torch.sum(x, dim=axis) (with keepdim=false).
Instances For
Generic broadcasting-aware binary OpSpec.
The caller supplies:
- explicit broadcast proofs (
CanBroadcastTo) for both sides, and - a
reduce_backmap that takes a gradient in the broadcasted shapetand reduces it back to the left shapes1.
PyTorch analogy: this is where PyTorch's implicit broadcasting rules and reduction-of-broadcasted gradients ("sum over broadcasted dimensions") happen. In TorchLean we keep those shape relations explicit.
Instances For
Convenience: broadcasting-aware add with caller-provided reduction.
Instances For
Convenience: broadcasting-aware mul with caller-provided reduction.