Positional encodings (spec layer) #

This file provides the simplest positional encoding definition: a learnable per-position embedding that is added to token embeddings.

PyTorch analogy:

This is the same idea as having an nn.Embedding(max_len, d_model) (or a parameter tensor [max_len, d_model]) and doing x + pos[:seqLen].

Why learnable positional encodings show up a lot in practice:

they are easy to train and tend to work well for fixed-length settings (e.g. ViT with a chosen patch grid, or language models trained with a fixed max sequence length),
they keep the spec lightweight: there is no trigonometry, complex numbers, or special casing for even/odd dimensions.

If you want sinusoidal encodings (Transformer) or RoPE/rotary encodings, those can be defined as pure functions that produce a tensor of shape (seqLen, embedDim) and then reused with the same add_positional_encoding_spec below.

Reference (sinusoidal): "Attention Is All You Need" (Vaswani et al., 2017): https://arxiv.org/abs/1706.03762

source

structure Spec.PositionalEncodingSpec (seqLen embedDim : ℕ) (α : Type) :

Type

Learnable positional encoding parameters for a fixed (seqLen, embedDim).

This is intentionally just a tensor of trainable parameters. Higher-level models decide how to initialize it and whether to share/resize it across different sequence lengths.

pos : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))
pos.

Instances For

source

def Spec.addPositionalEncodingSpec {α : Type} [Context α] {seqLen embedDim : ℕ} (pe : PositionalEncodingSpec seqLen embedDim α) (x : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))) :

Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))

Add positional encodings: y = x + pos.

Both x and pos have the same shape, so this is just elementwise addition (no broadcasting).

Instances For

Gradients #

Learnable positional encoding is just an elementwise addition:

y = x + pos

So the adjoint is just:

δx = δy
δpos = δy

This is trivial, but having it as a named spec makes higher-level models (e.g. ViT) easier to wire up without re-deriving the same one-liner everywhere.

source

def Spec.addPositionalEncodingBackwardSpec {α : Type} {seqLen embedDim : ℕ} (_pe : PositionalEncodingSpec seqLen embedDim α) (grad_output : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))) :

Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar)) × Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))

Backward/VJP for add_positional_encoding_spec.

Instances For

Sinusoidal positional encodings (pure functions) #

These are the classic Transformer sinusoidal encodings from:

Vaswani et al. (2017), "Attention Is All You Need".

We implement them as pure tensor generators so they can be reused in multiple model specs without adding new trainable parameters.

source

def Spec.posencFreqDenomSpec {α : Type} [Context α] (i d : ℕ) :

Frequency denominator used by both sinusoidal PE and RoPE:

denom(i; d) = 10000^(2*i / d)

implemented as:

exp(log(10000) * (2*i / d))

This keeps the definition in terms of the scalar Context operations.

Instances For

source

def Spec.posencAngleSpec {α : Type} [Context α] (pos i d : ℕ) :

Common angle used by sinusoidal PE and RoPE:

θ(pos, i; d) = pos / 10000^(2*i/d).

Instances For

source

def Spec.sinusoidalPositionalEncodingSpec {α : Type} [Context α] (seqLen embedDim : ℕ) (startPos : ℕ := 0) :

Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))

Pure sinusoidal positional encoding tensor with shape (seqLen, embedDim).

Definition (Transformer):

PE[pos, 2i] = sin(θ(pos, i; embedDim))
PE[pos, 2i+1] = cos(θ(pos, i; embedDim))

startPos is an offset for the absolute positions; use it when generating a chunk of positions for cached decoding (e.g. tokens startPos .. startPos+seqLen-1).

This definition is total for all seqLen/embedDim:

if embedDim = 0, the inner dimension is empty, so no scalar computations are observed.
if embedDim is odd, the last column uses the same i = floor(j/2) convention as usual.

Instances For

source

def Spec.addSinusoidalPositionalEncodingSpec {α : Type} [Context α] {seqLen embedDim : ℕ} (x : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))) (startPos : ℕ := 0) :

Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))

Add sinusoidal positional encodings: y = x + sinusoidal(startPos, seqLen, embedDim).

Instances For

Rotary positional embeddings (RoPE) utilities (pure functions) #

RoPE (Su et al., 2021) encodes position by applying a 2D rotation to each pair of features in the last dimension.

In most transformer implementations, RoPE is applied to query/key head vectors:

per head: (seqLen, headDim)
all heads: (numHeads, seqLen, headDim)

This file intentionally provides only pure RoPE helpers; we do not integrate them into attention yet.

References:

Su et al. (2021), "RoFormer: Enhanced Transformer with Rotary Position Embedding".
Many implementations use the same θ(pos,i;d) frequency schedule as sinusoidal PE.

source

def Spec.ropeRotatePairsLastdimSpec {α : Type} [Context α] {headDim : ℕ} (x : Tensor α (Shape.dim headDim Shape.scalar)) :

Tensor α (Shape.dim headDim Shape.scalar)

Rotate pairs on the last dimension:

(x0, x1, x2, x3, ...) ↦ (-x1, x0, -x3, x2, ...).

This corresponds to multiplying each 2-vector (x_even, x_odd) by the matrix: [[0, -1], [1, 0]].

Design note:

Standard RoPE assumes headDim is even.
This spec function is total: if headDim is odd, the last (unpaired) entry is left unchanged.

Instances For

source

def Spec.ropeCosLastdimSpec {α : Type} [Context α] (pos headDim : ℕ) :

Tensor α (Shape.dim headDim Shape.scalar)

Broadcast RoPE cos(θ) factors to a full (headDim) vector for one position.

Instances For

source

def Spec.ropeSinLastdimSpec {α : Type} [Context α] (pos headDim : ℕ) :

Tensor α (Shape.dim headDim Shape.scalar)

Broadcast RoPE sin(θ) factors to a full (headDim) vector for one position.

Instances For

source

def Spec.ropeApplySpec {α : Type} [Context α] {seqLen headDim : ℕ} (x : Tensor α (Shape.dim seqLen (Shape.dim headDim Shape.scalar))) (startPos : ℕ := 0) :

Tensor α (Shape.dim seqLen (Shape.dim headDim Shape.scalar))

Apply RoPE to a single head matrix x : (seqLen, headDim).

Implementation matches the standard identity:

rope(x) = x * cos + rotatePairs(x) * sin

where cos and sin are position-dependent vectors broadcast across the last dimension.

startPos is an absolute-position offset (useful for KV-cache decoding).

Instances For

source

def Spec.ropeApplyHeadsSpec {α : Type} [Context α] {numHeads seqLen headDim : ℕ} (x : Tensor α (Shape.dim numHeads (Shape.dim seqLen (Shape.dim headDim Shape.scalar)))) (startPos : ℕ := 0) :

Tensor α (Shape.dim numHeads (Shape.dim seqLen (Shape.dim headDim Shape.scalar)))

Apply RoPE to (numHeads, seqLen, headDim) by applying rope_apply_spec independently per head.

Instances For

TorchLean API

NN.Spec.Layers.PositionalEncoding

Positional encodings (spec layer) #

Gradients #

Sinusoidal positional encodings (pure functions) #

Rotary positional embeddings (RoPE) utilities (pure functions) #