TorchLean API

NN.Spec.Layers.PositionalEncoding

Positional encodings (spec layer) #

This file provides the simplest positional encoding definition: a learnable per-position embedding that is added to token embeddings.

PyTorch analogy:

Why learnable positional encodings show up a lot in practice:

If you want sinusoidal encodings (Transformer) or RoPE/rotary encodings, those can be defined as pure functions that produce a tensor of shape (seqLen, embedDim) and then reused with the same add_positional_encoding_spec below.

Reference (sinusoidal): "Attention Is All You Need" (Vaswani et al., 2017): https://arxiv.org/abs/1706.03762

structure Spec.PositionalEncodingSpec (seqLen embedDim : ) (α : Type) :

Learnable positional encoding parameters for a fixed (seqLen, embedDim).

This is intentionally just a tensor of trainable parameters. Higher-level models decide how to initialize it and whether to share/resize it across different sequence lengths.

Instances For
    def Spec.addPositionalEncodingSpec {α : Type} [Context α] {seqLen embedDim : } (pe : PositionalEncodingSpec seqLen embedDim α) (x : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))) :
    Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))

    Add positional encodings: y = x + pos.

    Both x and pos have the same shape, so this is just elementwise addition (no broadcasting).

    Instances For

      Gradients #

      Learnable positional encoding is just an elementwise addition:

      y = x + pos

      So the adjoint is just:

      This is trivial, but having it as a named spec makes higher-level models (e.g. ViT) easier to wire up without re-deriving the same one-liner everywhere.

      def Spec.addPositionalEncodingBackwardSpec {α : Type} {seqLen embedDim : } (_pe : PositionalEncodingSpec seqLen embedDim α) (grad_output : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))) :
      Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar)) × Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))

      Backward/VJP for add_positional_encoding_spec.

      Instances For

        Sinusoidal positional encodings (pure functions) #

        These are the classic Transformer sinusoidal encodings from:

        Vaswani et al. (2017), "Attention Is All You Need".

        We implement them as pure tensor generators so they can be reused in multiple model specs without adding new trainable parameters.

        def Spec.posencFreqDenomSpec {α : Type} [Context α] (i d : ) :
        α

        Frequency denominator used by both sinusoidal PE and RoPE:

        denom(i; d) = 10000^(2*i / d)

        implemented as:

        exp(log(10000) * (2*i / d))

        This keeps the definition in terms of the scalar Context operations.

        Instances For
          def Spec.posencAngleSpec {α : Type} [Context α] (pos i d : ) :
          α

          Common angle used by sinusoidal PE and RoPE:

          θ(pos, i; d) = pos / 10000^(2*i/d).

          Instances For
            def Spec.sinusoidalPositionalEncodingSpec {α : Type} [Context α] (seqLen embedDim : ) (startPos : := 0) :
            Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))

            Pure sinusoidal positional encoding tensor with shape (seqLen, embedDim).

            Definition (Transformer):

            • PE[pos, 2i] = sin(θ(pos, i; embedDim))
            • PE[pos, 2i+1] = cos(θ(pos, i; embedDim))

            startPos is an offset for the absolute positions; use it when generating a chunk of positions for cached decoding (e.g. tokens startPos .. startPos+seqLen-1).

            This definition is total for all seqLen/embedDim:

            • if embedDim = 0, the inner dimension is empty, so no scalar computations are observed.
            • if embedDim is odd, the last column uses the same i = floor(j/2) convention as usual.
            Instances For
              def Spec.addSinusoidalPositionalEncodingSpec {α : Type} [Context α] {seqLen embedDim : } (x : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))) (startPos : := 0) :
              Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))

              Add sinusoidal positional encodings: y = x + sinusoidal(startPos, seqLen, embedDim).

              Instances For

                Rotary positional embeddings (RoPE) utilities (pure functions) #

                RoPE (Su et al., 2021) encodes position by applying a 2D rotation to each pair of features in the last dimension.

                In most transformer implementations, RoPE is applied to query/key head vectors:

                This file intentionally provides only pure RoPE helpers; we do not integrate them into attention yet.

                References:

                def Spec.ropeRotatePairsLastdimSpec {α : Type} [Context α] {headDim : } (x : Tensor α (Shape.dim headDim Shape.scalar)) :

                Rotate pairs on the last dimension:

                (x0, x1, x2, x3, ...) ↦ (-x1, x0, -x3, x2, ...).

                This corresponds to multiplying each 2-vector (x_even, x_odd) by the matrix: [[0, -1], [1, 0]].

                Design note:

                • Standard RoPE assumes headDim is even.
                • This spec function is total: if headDim is odd, the last (unpaired) entry is left unchanged.
                Instances For
                  def Spec.ropeCosLastdimSpec {α : Type} [Context α] (pos headDim : ) :

                  Broadcast RoPE cos(θ) factors to a full (headDim) vector for one position.

                  Instances For
                    def Spec.ropeSinLastdimSpec {α : Type} [Context α] (pos headDim : ) :

                    Broadcast RoPE sin(θ) factors to a full (headDim) vector for one position.

                    Instances For
                      def Spec.ropeApplySpec {α : Type} [Context α] {seqLen headDim : } (x : Tensor α (Shape.dim seqLen (Shape.dim headDim Shape.scalar))) (startPos : := 0) :
                      Tensor α (Shape.dim seqLen (Shape.dim headDim Shape.scalar))

                      Apply RoPE to a single head matrix x : (seqLen, headDim).

                      Implementation matches the standard identity:

                      rope(x) = x * cos + rotatePairs(x) * sin

                      where cos and sin are position-dependent vectors broadcast across the last dimension.

                      startPos is an absolute-position offset (useful for KV-cache decoding).

                      Instances For
                        def Spec.ropeApplyHeadsSpec {α : Type} [Context α] {numHeads seqLen headDim : } (x : Tensor α (Shape.dim numHeads (Shape.dim seqLen (Shape.dim headDim Shape.scalar)))) (startPos : := 0) :
                        Tensor α (Shape.dim numHeads (Shape.dim seqLen (Shape.dim headDim Shape.scalar)))

                        Apply RoPE to (numHeads, seqLen, headDim) by applying rope_apply_spec independently per head.

                        Instances For