TorchLean API

NN.Spec.Layers.Activation

Activation functions (spec layer) #

This module is TorchLean's "activation toolbox": pure mathematical definitions of common nonlinearities and their derivatives.

Design intent:

PyTorch mental model:

Notes on scalar polymorphism:

TorchLean tries hard not to bake "Float everywhere" into the spec. All definitions are written against a Context α plus the exact algebra/analysis typeclasses they need. That is what lets the same layer definitions instantiate over:

References / analogies (stable entry points):

Scalar activations #

def Activation.Math.reluSpec {α : Type} [Zero α] [Max α] (x : α) :
α

ReLU: relu(x) = max(x, 0).

PyTorch analogy: torch.nn.functional.relu.

This is the simplest nonlinearity we use throughout TorchLean because it stays meaningful across many scalar backends (including ones that do not support exp/log).

Instances For
    def Activation.Math.reluDerivSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] (x : α) :
    α

    A standard subgradient choice for ReLU:

    d/dx relu(x) = 1 if x > 0, and 0 otherwise.

    PyTorch analogy: autograd picks a subgradient at x = 0; our spec commits to a concrete one to make "the derivative" a pure function.

    The DecidableRel (· > ·) constraint reflects that this definition literally branches on x > 0.

    Instances For
      def Activation.Math.sigmoidSpec {α : Type} [Context α] (x : α) :
      α

      Logistic sigmoid:

      sigmoid(x) = 1 / (1 + exp(-x)).

      PyTorch analogy: torch.nn.functional.sigmoid (or torch.sigmoid).

      Instances For
        def Activation.Math.sigmoidDerivSpec {α : Type} [Context α] (x : α) :
        α

        Derivative of sigmoid:

        sigmoid'(x) = σ(x) * (1 - σ(x)).

        We write it this way (in terms of σ(x)) because that is the form used in most AD systems and it avoids re-expanding the exponential expression.

        Instances For
          def Activation.Math.tanhSpec {α : Type} [Context α] (x : α) :
          α

          Hyperbolic tangent: tanh(x). PyTorch analogy: torch.tanh.

          Instances For
            def Activation.Math.tanhDerivSpec {α : Type} [Context α] (x : α) :
            α

            Derivative of tanh:

            tanh'(x) = 1 - tanh(x)^2.

            Instances For
              def Activation.Math.leakyReluSpec {α : Type} [Zero α] [Mul α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] (x αₗ : α) :
              α

              Leaky ReLU:

              leaky_relu(x; α) = x if x > 0, else α * x.

              PyTorch analogy: torch.nn.functional.leaky_relu with negative_slope = α.

              Instances For
                def Activation.Math.leakyReluDerivSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] (x αₗ : α) :
                α

                Derivative of leaky ReLU:

                d/dx leaky_relu(x; α) = 1 if x > 0, else α.

                Instances For
                  def Activation.Math.sinhSpec {α : Type} [Context α] (x : α) :
                  α

                  Sinh: sinh(x).

                  Instances For
                    def Activation.Math.sinhDerivSpec {α : Type} [Context α] (x : α) :
                    α

                    Sinh derivative: cosh(x).

                    Instances For
                      def Activation.Math.coshSpec {α : Type} [Context α] (x : α) :
                      α

                      Cosh: cosh(x).

                      Instances For
                        def Activation.Math.coshDerivSpec {α : Type} [Context α] (x : α) :
                        α

                        Cosh derivative: sinh(x).

                        Instances For
                          def Activation.Math.logisticSpec {α : Type} [Context α] (x : α) :
                          α

                          Logistic form written as exp(x)/(exp(x)+1).

                          This is mathematically the same sigmoid function as sigmoidSpec; we keep it as logisticSpec because several scalar approximation proofs reason about this exp(x) numerator form directly.

                          Important naming choice: this is not called scalar softmax. A one-entry softmax is always 1; the real softmax API in TorchLean is the tensor-level Activation.softmaxSpec below.

                          Instances For
                            def Activation.Math.logisticDerivSpec {α : Type} [Context α] (x : α) :
                            α

                            Derivative of logisticSpec, expressed in output form.

                            Instances For
                              def Activation.Math.eluSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [MathFunctions α] [Sub α] [Mul α] (x alpha : α) :
                              α

                              ELU (Exponential Linear Unit):

                              elu(x; α) = x if x > 0, else α * (exp(x) - 1).

                              PyTorch analogy: torch.nn.functional.elu with alpha = α.

                              Instances For
                                def Activation.Math.eluDerivSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [MathFunctions α] [Mul α] (x alpha : α) :
                                α

                                Derivative of ELU:

                                elu'(x; α) = 1 if x > 0, else α * exp(x).

                                Instances For
                                  def Activation.Math.geluSpec {α : Type} [MathFunctions α] [OfScientific α] [Add α] [Mul α] [Div α] [Sub α] [OfNat α 1] (x : α) :
                                  α

                                  GELU (approximate): the common tanh-based approximation used in many Transformer codebases.

                                  PyTorch analogy: torch.nn.functional.gelu(x, approximate="tanh").

                                  Instances For
                                    def Activation.Math.geluDerivSpec {α : Type} [MathFunctions α] [OfScientific α] [Add α] [Mul α] [Div α] [Sub α] [OfNat α 1] (x : α) :
                                    α

                                    GELU derivative for the tanh-based approximation.

                                    Instances For
                                      def Activation.Math.swishSpec {α : Type} [Context α] (x : α) :
                                      α

                                      Swish / SiLU:

                                      swish(x) = x * sigmoid(x).

                                      PyTorch analogy: torch.nn.functional.silu.

                                      Instances For
                                        def Activation.Math.swishDerivSpec {α : Type} [Context α] (x : α) :
                                        α

                                        Derivative of Swish / SiLU.

                                        Written in terms of sigmoid(x) for the same reason as sigmoidDerivSpec: this is the form used by AD systems and is convenient to reuse in proofs.

                                        Instances For
                                          def Activation.Math.softplusSpec {α : Type} [Context α] (x : α) :
                                          α

                                          Softplus:

                                          softplus(x) = log(1 + exp(x)).

                                          PyTorch analogy: torch.nn.functional.softplus.

                                          Instances For
                                            def Activation.Math.softplusDerivSpec {α : Type} [Context α] (x : α) :
                                            α

                                            Derivative of softplus:

                                            softplus'(x) = sigmoid(x).

                                            Instances For
                                              def Activation.Math.safeLogSpec {α : Type} [Context α] (x : α) (ε : α := Numbers.epsilon) :
                                              α

                                              A smooth log surrogate:

                                              safe_log(x; ε) = log(softplus(x) + ε).

                                              We use this when we want something "log-like" without having to carry side conditions about the input being strictly positive.

                                              Instances For
                                                def Activation.Math.safeLogDerivSpec {α : Type} [Context α] (x : α) (ε : α := Numbers.epsilon) :
                                                α

                                                Derivative of safe_log_spec.

                                                Instances For
                                                  def Activation.Math.smoothAbsSpec {α : Type} [Context α] (x : α) (ε : α := Numbers.epsilon) :
                                                  α

                                                  A smooth absolute value surrogate:

                                                  smooth_abs(x; ε) = sqrt(x^2 + ε).

                                                  Useful when you want an abs-like shape but keep differentiability at 0.

                                                  Instances For
                                                    def Activation.Math.smoothAbsDerivSpec {α : Type} [Context α] (x : α) (ε : α := Numbers.epsilon) :
                                                    α

                                                    Derivative of smooth_abs_spec.

                                                    Instances For
                                                      def Activation.tanhSpec {α : Type} [Context α] {s : Spec.Shape} :
                                                      Spec.Tensor α sSpec.Tensor α s

                                                      Tensor-level tanh (pointwise).

                                                      PyTorch analogy: torch.tanh(t) or torch.nn.functional.tanh(t) applied elementwise.

                                                      Instances For
                                                        def Activation.reluSpec {α : Type} [Zero α] [Max α] {s : Spec.Shape} (t : Spec.Tensor α s) :

                                                        Tensor-level ReLU (pointwise).

                                                        Instances For
                                                          def Activation.sigmoidSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) :

                                                          Tensor-level sigmoid (pointwise).

                                                          Instances For
                                                            def Activation.reluDerivSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (t : Spec.Tensor α s) :

                                                            Tensor-level ReLU derivative (pointwise), using the scalar subgradient choice in Activation.Math.relu_deriv_spec.

                                                            Instances For

                                                              Tensor-level sigmoid derivative (pointwise).

                                                              Instances For
                                                                def Activation.sigmoidOutputDerivSpec {α : Type} [Context α] {s : Spec.Shape} (sigmoidOutput : Spec.Tensor α s) :

                                                                Derivative of sigmoid when the sigmoid output has already been computed.

                                                                Recurrent layers save gate activations during the forward pass, so their backward specs should use this shared helper instead of re-defining s * (1 - s) locally.

                                                                Instances For

                                                                  Tensor-level tanh derivative (pointwise).

                                                                  Instances For

                                                                    Proper (last‑axis) softmax on tensors #

                                                                    These are the shape‑aware softmax definitions used in attention / classification layers. They recurse over outer dimensions and apply a numerically‑stable softmax to the last axis.

                                                                    Softmax on a length-n vector.

                                                                    This is the "real" softmax, not the scalar logistic helper in Activation.Math.logisticSpec.

                                                                    Numerical stability:

                                                                    We implement the standard stabilized form softmax(x) = exp(x - m) / Σ exp(x - m) where m = max_i x_i. Subtracting the max avoids overflow in typical floating-point backends, and it is also a nice canonical form to reference in proofs.

                                                                    Instances For

                                                                      Softmax along the last axis (recurses over outer dimensions).

                                                                      PyTorch analogy: torch.softmax(x, dim=-1).

                                                                      For s = .scalar we return 1 (there is only one coordinate). For higher-rank tensors we keep the outer structure and apply softmax_vec_spec at the last axis.

                                                                      Instances For

                                                                        Backward/VJP for last-axis softmax.

                                                                        If y = softmax(x) and we are given an upstream gradient dL/dy, then for each last-axis slice:

                                                                        dL/dx = y ⊙ (dL/dy - ⟨dL/dy, y⟩)

                                                                        This is the standard Jacobian-vector product for softmax, written in a way that avoids materializing the full n×n Jacobian.

                                                                        Instances For

                                                                          Log-softmax on a length-n vector.

                                                                          Instances For

                                                                            Log-softmax along the last axis (recurses over outer dimensions).

                                                                            Instances For

                                                                              Backward/VJP for last-axis log-softmax.

                                                                              If y = log_softmax(x), then softmax(x) = exp(y) and the vector-Jacobian product is

                                                                              dL/dx = dL/dy - softmax(x) * sum(dL/dy).

                                                                              This is the same formula used by PyTorch's stable log_softmax backward path. We take the already-computed output y rather than the logits x, so runtime backends can avoid recomputing the max-shifted forward pass during backprop.

                                                                              Instances For
                                                                                def Activation.leakyReluSpec {α : Type} [Zero α] [Mul α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (t : Spec.Tensor α s) (αₗ : α) :

                                                                                Tensor-level leaky ReLU (pointwise). PyTorch analogy: torch.nn.functional.leaky_relu.

                                                                                Instances For
                                                                                  def Activation.leakyReluDerivSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {s : Spec.Shape} (t : Spec.Tensor α s) (αₗ : α) :

                                                                                  Tensor-level derivative of leaky ReLU (pointwise).

                                                                                  Instances For
                                                                                    def Activation.eluSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [MathFunctions α] [Sub α] [Mul α] {s : Spec.Shape} (t : Spec.Tensor α s) (alpha : α) :

                                                                                    Tensor-level ELU (pointwise). PyTorch analogy: torch.nn.functional.elu.

                                                                                    Instances For
                                                                                      def Activation.eluDerivSpec {α : Type} [Zero α] [One α] [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] [MathFunctions α] [Mul α] {s : Spec.Shape} (t : Spec.Tensor α s) (alpha : α) :

                                                                                      Tensor-level derivative of ELU (pointwise).

                                                                                      Instances For
                                                                                        def Activation.geluSpec {α : Type} [MathFunctions α] [OfScientific α] [Add α] [Mul α] [Div α] [Sub α] [OfNat α 1] {s : Spec.Shape} (t : Spec.Tensor α s) :

                                                                                        Tensor-level GELU (approximate, pointwise). PyTorch analogy: gelu(..., approximate="tanh").

                                                                                        Instances For
                                                                                          def Activation.geluDerivSpec {α : Type} [MathFunctions α] [OfScientific α] [Add α] [Mul α] [Div α] [Sub α] [OfNat α 1] {s : Spec.Shape} (t : Spec.Tensor α s) :

                                                                                          Tensor-level derivative of tanh-approx GELU (pointwise).

                                                                                          Instances For
                                                                                            def Activation.swishSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) :

                                                                                            Tensor-level Swish / SiLU (pointwise).

                                                                                            Instances For

                                                                                              Tensor-level derivative of Swish / SiLU (pointwise).

                                                                                              Instances For
                                                                                                def Activation.softplusSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) :

                                                                                                Tensor-level softplus (pointwise).

                                                                                                Instances For

                                                                                                  Tensor-level derivative of softplus (pointwise).

                                                                                                  Instances For
                                                                                                    def Activation.safeLogSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) (ε : α := Numbers.epsilon) :

                                                                                                    Tensor-level safe_log_spec (pointwise).

                                                                                                    Instances For
                                                                                                      def Activation.safeLogDerivSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) (ε : α := Numbers.epsilon) :

                                                                                                      Tensor-level derivative of safe_log_spec (pointwise).

                                                                                                      Instances For
                                                                                                        def Activation.smoothAbsSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) (ε : α := Numbers.epsilon) :

                                                                                                        Tensor-level smooth_abs_spec (pointwise).

                                                                                                        Instances For
                                                                                                          def Activation.smoothAbsDerivSpec {α : Type} [Context α] {s : Spec.Shape} (t : Spec.Tensor α s) (ε : α := Numbers.epsilon) :

                                                                                                          Tensor-level derivative of smooth_abs_spec (pointwise).

                                                                                                          Instances For
                                                                                                            def Activation.activationGradientSpec {α : Type} [Context α] {s : Spec.Shape} (activation_deriv : Spec.Tensor α sSpec.Tensor α s) (input grad_output : Spec.Tensor α s) :

                                                                                                            A generic pointwise activation VJP helper.

                                                                                                            Given:

                                                                                                            • f' (as a tensor-level derivative function),
                                                                                                            • the forward input x,
                                                                                                            • and an upstream gradient dL/df(x),

                                                                                                            this returns dL/dx by the chain rule:

                                                                                                            dL/dx = dL/df(x) ⊙ f'(x).

                                                                                                            This matches how most PyTorch elementwise ops behave in backward: multiply upstream gradients by the pointwise derivative mask/value.

                                                                                                            Instances For