Activation functions (spec layer) #

This module is TorchLean's "activation toolbox": pure mathematical definitions of common nonlinearities and their derivatives.

Design intent:

Scalar definitions live under Activation.Math (functions α → α).
Tensor-level definitions are almost always the scalar function mapped pointwise via map_spec.
Where the math is inherently non-pointwise (notably softmax), we provide a shape-aware implementation plus an explicit backward/VJP.

PyTorch mental model:

Scalar Activation.Math.* corresponds to the formulas behind torch.nn.functional.*.
Tensor-level Activation.*_spec corresponds to applying that nonlinearity elementwise.
softmaxSpec here is the real last-axis softmax on tensors (like torch.softmax(x, dim=-1)), implemented recursively over outer dimensions.

Notes on scalar polymorphism:

TorchLean tries hard not to bake "Float everywhere" into the spec. All definitions are written against a Context α plus the exact algebra/analysis typeclasses they need. That is what lets the same layer definitions instantiate over:

Float for fast runtime execution,
exact/reasoning scalars for proofs,
interval-like scalars for verification.

References / analogies (stable entry points):

PyTorch activations: https://pytorch.org/docs/stable/nn.functional.html
PyTorch torch.softmax: https://pytorch.org/docs/stable/generated/torch.softmax.html
ReLU: Vinod Nair and Geoffrey Hinton, "Rectified Linear Units Improve Restricted Boltzmann Machines" (ICML 2010)
ELU: Djork-Arne Clevert et al., "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)" (ICLR 2016)
GELU: Dan Hendrycks and Kevin Gimpel, "Gaussian Error Linear Units (GELUs)" (arXiv:1606.08415)
Swish / SiLU: Prajit Ramachandran et al., "Searching for Activation Functions" (arXiv:1710.05941)

Scalar activations #

source

def Activation.Math.reluSpec {α : Type} [Zero α] [Max α] (x : α) :

ReLU: relu(x) = max(x, 0).

PyTorch analogy: torch.nn.functional.relu.

This is the simplest nonlinearity we use throughout TorchLean because it stays meaningful across many scalar backends (including ones that do not support exp/log).

Instances For

source

def Activation.Math.reluDerivSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] (x : α) :

A standard subgradient choice for ReLU:

d/dx relu(x) = 1 if x > 0, and 0 otherwise.

PyTorch analogy: autograd picks a subgradient at x = 0; our spec commits to a concrete one to make "the derivative" a pure function.

The DecidableRel (· > ·) constraint reflects that this definition literally branches on x > 0.

Instances For

source

def Activation.Math.sigmoidSpec {α : Type} [Context α] (x : α) :

Logistic sigmoid:

sigmoid(x) = 1 / (1 + exp(-x)).

PyTorch analogy: torch.nn.functional.sigmoid (or torch.sigmoid).

Instances For

source

def Activation.Math.sigmoidDerivSpec {α : Type} [Context α] (x : α) :

Derivative of sigmoid:

sigmoid'(x) = σ(x) * (1 - σ(x)).

We write it this way (in terms of σ(x)) because that is the form used in most AD systems and it avoids re-expanding the exponential expression.

Instances For

source

def Activation.Math.tanhSpec {α : Type} [Context α] (x : α) :

Hyperbolic tangent: tanh(x). PyTorch analogy: torch.tanh.

Instances For

source

def Activation.Math.tanhDerivSpec {α : Type} [Context α] (x : α) :

Derivative of tanh:

tanh'(x) = 1 - tanh(x)^2.

Instances For

source

def Activation.Math.leakyReluSpec {α : Type} [Zero α] [Mul α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] (x αₗ : α) :

Leaky ReLU:

leaky_relu(x; α) = x if x > 0, else α * x.

PyTorch analogy: torch.nn.functional.leaky_relu with negative_slope = α.

Instances For

source

def Activation.Math.leakyReluDerivSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] (x αₗ : α) :

Derivative of leaky ReLU:

d/dx leaky_relu(x; α) = 1 if x > 0, else α.

Instances For

source

def Activation.Math.sinhSpec {α : Type} [Context α] (x : α) :

Sinh: sinh(x).

Instances For

source

def Activation.Math.sinhDerivSpec {α : Type} [Context α] (x : α) :

Sinh derivative: cosh(x).

Instances For

source

def Activation.Math.coshSpec {α : Type} [Context α] (x : α) :

Cosh: cosh(x).

Instances For

source

def Activation.Math.coshDerivSpec {α : Type} [Context α] (x : α) :

Cosh derivative: sinh(x).

Instances For

source

def Activation.Math.logisticSpec {α : Type} [Context α] (x : α) :

Logistic form written as exp(x)/(exp(x)+1).

This is mathematically the same sigmoid function as sigmoidSpec; we keep it as logisticSpec because several scalar approximation proofs reason about this exp(x) numerator form directly.

Important naming choice: this is not called scalar softmax. A one-entry softmax is always 1; the real softmax API in TorchLean is the tensor-level Activation.softmaxSpec below.

Instances For

source

def Activation.Math.logisticDerivSpec {α : Type} [Context α] (x : α) :

Derivative of logisticSpec, expressed in output form.

Instances For

source

def Activation.Math.eluSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [MathFunctions α] [Sub α] [Mul α] (x alpha : α) :

ELU (Exponential Linear Unit):

elu(x; α) = x if x > 0, else α * (exp(x) - 1).

PyTorch analogy: torch.nn.functional.elu with alpha = α.

Instances For

source

def Activation.Math.eluDerivSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [MathFunctions α] [Mul α] (x alpha : α) :

Derivative of ELU:

elu'(x; α) = 1 if x > 0, else α * exp(x).

Instances For

source

def Activation.Math.geluSpec {α : Type} [MathFunctions α] [OfScientific α] [Add α] [Mul α] [Div α] [Sub α] [OfNat α 1] (x : α) :

GELU (approximate): the common tanh-based approximation used in many Transformer codebases.

PyTorch analogy: torch.nn.functional.gelu(x, approximate="tanh").

Instances For

source

def Activation.Math.geluDerivSpec {α : Type} [MathFunctions α] [OfScientific α] [Add α] [Mul α] [Div α] [Sub α] [OfNat α 1] (x : α) :

GELU derivative for the tanh-based approximation.

Instances For

source

def Activation.Math.swishSpec {α : Type} [Context α] (x : α) :

Swish / SiLU:

swish(x) = x * sigmoid(x).

PyTorch analogy: torch.nn.functional.silu.

Instances For

source

def Activation.Math.swishDerivSpec {α : Type} [Context α] (x : α) :

Derivative of Swish / SiLU.

Written in terms of sigmoid(x) for the same reason as sigmoidDerivSpec: this is the form used by AD systems and is convenient to reuse in proofs.

Instances For

source

def Activation.Math.softplusSpec {α : Type} [Context α] (x : α) :

Softplus:

softplus(x) = log(1 + exp(x)).

PyTorch analogy: torch.nn.functional.softplus.

Instances For

source

def Activation.Math.softplusDerivSpec {α : Type} [Context α] (x : α) :

Derivative of softplus:

softplus'(x) = sigmoid(x).

Instances For

source

def Activation.Math.safeLogSpec {α : Type} [Context α] (x : α) (ε : α := Numbers.epsilon) :

A smooth log surrogate:

safe_log(x; ε) = log(softplus(x) + ε).

We use this when we want something "log-like" without having to carry side conditions about the input being strictly positive.

Instances For

source

def Activation.Math.safeLogDerivSpec {α : Type} [Context α] (x : α) (ε : α := Numbers.epsilon) :

Derivative of safe_log_spec.

Instances For

source

def Activation.Math.smoothAbsSpec {α : Type} [Context α] (x : α) (ε : α := Numbers.epsilon) :

A smooth absolute value surrogate:

smooth_abs(x; ε) = sqrt(x^2 + ε).

Useful when you want an abs-like shape but keep differentiability at 0.

Instances For

source

def Activation.Math.smoothAbsDerivSpec {α : Type} [Context α] (x : α) (ε : α := Numbers.epsilon) :

Derivative of smooth_abs_spec.

Instances For

source

def Activation.tanhSpec {α : Type} [Context α] {s : Spec.Shape} :

Spec.Tensor α s → Spec.Tensor α s

Tensor-level tanh (pointwise).

PyTorch analogy: torch.tanh(t) or torch.nn.functional.tanh(t) applied elementwise.

Instances For

source

def Activation.reluSpec {α : Type} [Zero α] [Max α] {s : Spec.Shape} (t : Spec.Tensor α s) :

Spec.Tensor α s

Tensor-level ReLU (pointwise).

Instances For

source

def Activation.sigmoidSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) :

Spec.Tensor α s

Tensor-level sigmoid (pointwise).

Instances For

source

def Activation.reluDerivSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (t : Spec.Tensor α s) :

Spec.Tensor α s

Tensor-level ReLU derivative (pointwise), using the scalar subgradient choice in Activation.Math.relu_deriv_spec.

Instances For

source

def Activation.sigmoidDerivSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) :

Spec.Tensor α s

Tensor-level sigmoid derivative (pointwise).

Instances For

source

def Activation.sigmoidOutputDerivSpec {α : Type} [Context α] {s : Spec.Shape} (sigmoidOutput : Spec.Tensor α s) :

Spec.Tensor α s

Derivative of sigmoid when the sigmoid output has already been computed.

Recurrent layers save gate activations during the forward pass, so their backward specs should use this shared helper instead of re-defining s * (1 - s) locally.

Instances For

source

def Activation.tanhDerivSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) :

Spec.Tensor α s

Tensor-level tanh derivative (pointwise).

Instances For

Proper (last‑axis) softmax on tensors #

These are the shape‑aware softmax definitions used in attention / classification layers. They recurse over outer dimensions and apply a numerically‑stable softmax to the last axis.

source

def Activation.softmaxVecSpec {α : Type} [Context α] {n : ℕ} (t : Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)) :

Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)

Softmax on a length-n vector.

This is the "real" softmax, not the scalar logistic helper in Activation.Math.logisticSpec.

Numerical stability:

We implement the standard stabilized form softmax(x) = exp(x - m) / Σ exp(x - m) where m = max_i x_i. Subtracting the max avoids overflow in typical floating-point backends, and it is also a nice canonical form to reference in proofs.

Instances For

source

def Activation.softmaxSpec {α : Type} [Context α] {s : Spec.Shape} :

Spec.Tensor α s → Spec.Tensor α s

Softmax along the last axis (recurses over outer dimensions).

PyTorch analogy: torch.softmax(x, dim=-1).

For s = .scalar we return 1 (there is only one coordinate). For higher-rank tensors we keep the outer structure and apply softmax_vec_spec at the last axis.

Instances For

source

def Activation.softmaxBackwardSpec {α : Type} [Context α] {s : Spec.Shape} :

Spec.Tensor α s → Spec.Tensor α s → Spec.Tensor α s

Backward/VJP for last-axis softmax.

If y = softmax(x) and we are given an upstream gradient dL/dy, then for each last-axis slice:

dL/dx = y ⊙ (dL/dy - ⟨dL/dy, y⟩)

This is the standard Jacobian-vector product for softmax, written in a way that avoids materializing the full n×n Jacobian.

Instances For

source

def Activation.logSoftmaxVecSpec {α : Type} [Context α] {n : ℕ} (t : Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)) :

Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)

Log-softmax on a length-n vector.

Instances For

source

def Activation.logSoftmaxSpec {α : Type} [Context α] {s : Spec.Shape} :

Spec.Tensor α s → Spec.Tensor α s

Log-softmax along the last axis (recurses over outer dimensions).

Instances For

source

def Activation.logSoftmaxBackwardSpec {α : Type} [Context α] {s : Spec.Shape} :

Spec.Tensor α s → Spec.Tensor α s → Spec.Tensor α s

Backward/VJP for last-axis log-softmax.

If y = log_softmax(x), then softmax(x) = exp(y) and the vector-Jacobian product is

dL/dx = dL/dy - softmax(x) * sum(dL/dy).

This is the same formula used by PyTorch's stable log_softmax backward path. We take the already-computed output y rather than the logits x, so runtime backends can avoid recomputing the max-shifted forward pass during backprop.

Instances For

source

def Activation.leakyReluSpec {α : Type} [Zero α] [Mul α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (t : Spec.Tensor α s) (αₗ : α) :

Spec.Tensor α s

Tensor-level leaky ReLU (pointwise). PyTorch analogy: torch.nn.functional.leaky_relu.

Instances For

source

def Activation.leakyReluDerivSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (t : Spec.Tensor α s) (αₗ : α) :

Spec.Tensor α s

Tensor-level derivative of leaky ReLU (pointwise).

Instances For

source

def Activation.eluSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [MathFunctions α] [Sub α] [Mul α] {s : Spec.Shape} (t : Spec.Tensor α s) (alpha : α) :

Spec.Tensor α s

Tensor-level ELU (pointwise). PyTorch analogy: torch.nn.functional.elu.

Instances For

source

def Activation.eluDerivSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [MathFunctions α] [Mul α] {s : Spec.Shape} (t : Spec.Tensor α s) (alpha : α) :

Spec.Tensor α s

Tensor-level derivative of ELU (pointwise).

Instances For

source

def Activation.geluSpec {α : Type} [MathFunctions α] [OfScientific α] [Add α] [Mul α] [Div α] [Sub α] [OfNat α 1] {s : Spec.Shape} (t : Spec.Tensor α s) :

Spec.Tensor α s

Tensor-level GELU (approximate, pointwise). PyTorch analogy: gelu(..., approximate="tanh").

Instances For

source

def Activation.geluDerivSpec {α : Type} [MathFunctions α] [OfScientific α] [Add α] [Mul α] [Div α] [Sub α] [OfNat α 1] {s : Spec.Shape} (t : Spec.Tensor α s) :

Spec.Tensor α s

Tensor-level derivative of tanh-approx GELU (pointwise).

Instances For

source

def Activation.swishSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) :

Spec.Tensor α s

Tensor-level Swish / SiLU (pointwise).

Instances For

source

def Activation.swishDerivSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) :

Spec.Tensor α s

Tensor-level derivative of Swish / SiLU (pointwise).

Instances For

source

def Activation.softplusSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) :

Spec.Tensor α s

Tensor-level softplus (pointwise).

Instances For

source

def Activation.softplusDerivSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) :

Spec.Tensor α s

Tensor-level derivative of softplus (pointwise).

Instances For

source

def Activation.safeLogSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) (ε : α := Numbers.epsilon) :

Spec.Tensor α s

Tensor-level safe_log_spec (pointwise).

Instances For

source

def Activation.safeLogDerivSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) (ε : α := Numbers.epsilon) :

Spec.Tensor α s

Tensor-level derivative of safe_log_spec (pointwise).

Instances For

source

def Activation.smoothAbsSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) (ε : α := Numbers.epsilon) :

Spec.Tensor α s

Tensor-level smooth_abs_spec (pointwise).

Instances For

source

def Activation.smoothAbsDerivSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) (ε : α := Numbers.epsilon) :

Spec.Tensor α s

Tensor-level derivative of smooth_abs_spec (pointwise).

Instances For

source

def Activation.activationGradientSpec {α : Type} [Context α] {s : Spec.Shape} (activation_deriv : Spec.Tensor α s → Spec.Tensor α s) (input grad_output : Spec.Tensor α s) :

Spec.Tensor α s

A generic pointwise activation VJP helper.

Given:

f' (as a tensor-level derivative function),
the forward input x,
and an upstream gradient dL/df(x),

this returns dL/dx by the chain rule:

dL/dx = dL/df(x) ⊙ f'(x).

This matches how most PyTorch elementwise ops behave in backward: multiply upstream gradients by the pointwise derivative mask/value.

Instances For

TorchLean API

NN.Spec.Layers.Activation

Activation functions (spec layer) #

Scalar activations #

Proper (last‑axis) softmax on tensors #