Activation functions (spec layer) #
This module is TorchLean's "activation toolbox": pure mathematical definitions of common nonlinearities and their derivatives.
Design intent:
- Scalar definitions live under
Activation.Math(functionsα → α). - Tensor-level definitions are almost always the scalar function mapped pointwise via
map_spec. - Where the math is inherently non-pointwise (notably
softmax), we provide a shape-aware implementation plus an explicit backward/VJP.
PyTorch mental model:
- Scalar
Activation.Math.*corresponds to the formulas behindtorch.nn.functional.*. - Tensor-level
Activation.*_speccorresponds to applying that nonlinearity elementwise. softmaxSpechere is the real last-axis softmax on tensors (liketorch.softmax(x, dim=-1)), implemented recursively over outer dimensions.
Notes on scalar polymorphism:
TorchLean tries hard not to bake "Float everywhere" into the spec. All definitions are written
against a Context α plus the exact algebra/analysis typeclasses they need. That is what lets
the same layer definitions instantiate over:
Floatfor fast runtime execution,- exact/reasoning scalars for proofs,
- interval-like scalars for verification.
References / analogies (stable entry points):
- PyTorch activations: https://pytorch.org/docs/stable/nn.functional.html
- PyTorch
torch.softmax: https://pytorch.org/docs/stable/generated/torch.softmax.html - ReLU: Vinod Nair and Geoffrey Hinton, "Rectified Linear Units Improve Restricted Boltzmann Machines" (ICML 2010)
- ELU: Djork-Arne Clevert et al., "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)" (ICLR 2016)
- GELU: Dan Hendrycks and Kevin Gimpel, "Gaussian Error Linear Units (GELUs)" (arXiv:1606.08415)
- Swish / SiLU: Prajit Ramachandran et al., "Searching for Activation Functions" (arXiv:1710.05941)
Scalar activations #
ReLU: relu(x) = max(x, 0).
PyTorch analogy: torch.nn.functional.relu.
This is the simplest nonlinearity we use throughout TorchLean because it stays meaningful across
many scalar backends (including ones that do not support exp/log).
Instances For
A standard subgradient choice for ReLU:
d/dx relu(x) = 1 if x > 0, and 0 otherwise.
PyTorch analogy: autograd picks a subgradient at x = 0; our spec commits to a concrete one to
make "the derivative" a pure function.
The DecidableRel (· > ·) constraint reflects that this definition literally branches on x > 0.
Instances For
Logistic sigmoid:
sigmoid(x) = 1 / (1 + exp(-x)).
PyTorch analogy: torch.nn.functional.sigmoid (or torch.sigmoid).
Instances For
Derivative of sigmoid:
sigmoid'(x) = σ(x) * (1 - σ(x)).
We write it this way (in terms of σ(x)) because that is the form used in most AD systems and it
avoids re-expanding the exponential expression.
Instances For
Hyperbolic tangent: tanh(x). PyTorch analogy: torch.tanh.
Instances For
Derivative of tanh:
tanh'(x) = 1 - tanh(x)^2.
Instances For
Leaky ReLU:
leaky_relu(x; α) = x if x > 0, else α * x.
PyTorch analogy: torch.nn.functional.leaky_relu with negative_slope = α.
Instances For
Derivative of leaky ReLU:
d/dx leaky_relu(x; α) = 1 if x > 0, else α.
Instances For
Sinh derivative: cosh(x).
Instances For
Cosh derivative: sinh(x).
Instances For
Logistic form written as exp(x)/(exp(x)+1).
This is mathematically the same sigmoid function as sigmoidSpec; we keep it as logisticSpec
because several scalar approximation proofs reason about this exp(x) numerator form directly.
Important naming choice: this is not called scalar softmax. A one-entry softmax is always 1;
the real softmax API in TorchLean is the tensor-level Activation.softmaxSpec below.
Instances For
Derivative of logisticSpec, expressed in output form.
Instances For
ELU (Exponential Linear Unit):
elu(x; α) = x if x > 0, else α * (exp(x) - 1).
PyTorch analogy: torch.nn.functional.elu with alpha = α.
Instances For
Derivative of ELU:
elu'(x; α) = 1 if x > 0, else α * exp(x).
Instances For
GELU (approximate): the common tanh-based approximation used in many Transformer codebases.
PyTorch analogy: torch.nn.functional.gelu(x, approximate="tanh").
Instances For
GELU derivative for the tanh-based approximation.
Instances For
Swish / SiLU:
swish(x) = x * sigmoid(x).
PyTorch analogy: torch.nn.functional.silu.
Instances For
Derivative of Swish / SiLU.
Written in terms of sigmoid(x) for the same reason as sigmoidDerivSpec: this is the form
used by AD systems and is convenient to reuse in proofs.
Instances For
Softplus:
softplus(x) = log(1 + exp(x)).
PyTorch analogy: torch.nn.functional.softplus.
Instances For
Derivative of softplus:
softplus'(x) = sigmoid(x).
Instances For
A smooth log surrogate:
safe_log(x; ε) = log(softplus(x) + ε).
We use this when we want something "log-like" without having to carry side conditions about the input being strictly positive.
Instances For
Derivative of safe_log_spec.
Instances For
A smooth absolute value surrogate:
smooth_abs(x; ε) = sqrt(x^2 + ε).
Useful when you want an abs-like shape but keep differentiability at 0.
Instances For
Derivative of smooth_abs_spec.
Instances For
Tensor-level tanh (pointwise).
PyTorch analogy: torch.tanh(t) or torch.nn.functional.tanh(t) applied elementwise.
Instances For
Tensor-level ReLU (pointwise).
Instances For
Tensor-level sigmoid (pointwise).
Instances For
Tensor-level ReLU derivative (pointwise), using the scalar subgradient choice in
Activation.Math.relu_deriv_spec.
Instances For
Tensor-level sigmoid derivative (pointwise).
Instances For
Derivative of sigmoid when the sigmoid output has already been computed.
Recurrent layers save gate activations during the forward pass, so their backward specs should use
this shared helper instead of re-defining s * (1 - s) locally.
Instances For
Tensor-level tanh derivative (pointwise).
Instances For
Proper (last‑axis) softmax on tensors #
These are the shape‑aware softmax definitions used in attention / classification layers. They recurse over outer dimensions and apply a numerically‑stable softmax to the last axis.
Softmax on a length-n vector.
This is the "real" softmax, not the scalar logistic helper in Activation.Math.logisticSpec.
Numerical stability:
We implement the standard stabilized form softmax(x) = exp(x - m) / Σ exp(x - m) where
m = max_i x_i. Subtracting the max avoids overflow in typical floating-point backends, and it is
also a nice canonical form to reference in proofs.
Instances For
Softmax along the last axis (recurses over outer dimensions).
PyTorch analogy: torch.softmax(x, dim=-1).
For s = .scalar we return 1 (there is only one coordinate). For higher-rank tensors we keep
the outer structure and apply softmax_vec_spec at the last axis.
Instances For
Backward/VJP for last-axis softmax.
If y = softmax(x) and we are given an upstream gradient dL/dy, then for each last-axis slice:
dL/dx = y ⊙ (dL/dy - ⟨dL/dy, y⟩)
This is the standard Jacobian-vector product for softmax, written in a way that avoids materializing
the full n×n Jacobian.
Instances For
Log-softmax on a length-n vector.
Instances For
Log-softmax along the last axis (recurses over outer dimensions).
Instances For
Backward/VJP for last-axis log-softmax.
If y = log_softmax(x), then softmax(x) = exp(y) and the vector-Jacobian product is
dL/dx = dL/dy - softmax(x) * sum(dL/dy).
This is the same formula used by PyTorch's stable log_softmax backward path. We take the
already-computed output y rather than the logits x, so runtime backends can avoid recomputing
the max-shifted forward pass during backprop.
Instances For
Tensor-level leaky ReLU (pointwise). PyTorch analogy: torch.nn.functional.leaky_relu.
Instances For
Tensor-level derivative of leaky ReLU (pointwise).
Instances For
Tensor-level ELU (pointwise). PyTorch analogy: torch.nn.functional.elu.
Instances For
Tensor-level derivative of ELU (pointwise).
Instances For
Tensor-level GELU (approximate, pointwise). PyTorch analogy: gelu(..., approximate="tanh").
Instances For
Tensor-level derivative of tanh-approx GELU (pointwise).
Instances For
Tensor-level Swish / SiLU (pointwise).
Instances For
Tensor-level derivative of Swish / SiLU (pointwise).
Instances For
Tensor-level softplus (pointwise).
Instances For
Tensor-level derivative of softplus (pointwise).
Instances For
Tensor-level safe_log_spec (pointwise).
Instances For
Tensor-level derivative of safe_log_spec (pointwise).
Instances For
Tensor-level smooth_abs_spec (pointwise).
Instances For
Tensor-level derivative of smooth_abs_spec (pointwise).
Instances For
A generic pointwise activation VJP helper.
Given:
f'(as a tensor-level derivative function),- the forward input
x, - and an upstream gradient
dL/df(x),
this returns dL/dx by the chain rule:
dL/dx = dL/df(x) ⊙ f'(x).
This matches how most PyTorch elementwise ops behave in backward: multiply upstream gradients by the pointwise derivative mask/value.