Positional encodings (spec layer) #
This file provides the simplest positional encoding definition: a learnable per-position embedding that is added to token embeddings.
PyTorch analogy:
- This is the same idea as having an
nn.Embedding(max_len, d_model)(or a parameter tensor[max_len, d_model]) and doingx + pos[:seqLen].
Why learnable positional encodings show up a lot in practice:
- they are easy to train and tend to work well for fixed-length settings (e.g. ViT with a chosen patch grid, or language models trained with a fixed max sequence length),
- they keep the spec lightweight: there is no trigonometry, complex numbers, or special casing for even/odd dimensions.
If you want sinusoidal encodings (Transformer) or RoPE/rotary encodings, those can be defined as
pure functions that produce a tensor of shape (seqLen, embedDim) and then reused with the same
add_positional_encoding_spec below.
Reference (sinusoidal): "Attention Is All You Need" (Vaswani et al., 2017): https://arxiv.org/abs/1706.03762
Learnable positional encoding parameters for a fixed (seqLen, embedDim).
This is intentionally just a tensor of trainable parameters. Higher-level models decide how to initialize it and whether to share/resize it across different sequence lengths.
- pos : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))
pos.
Instances For
Add positional encodings: y = x + pos.
Both x and pos have the same shape, so this is just elementwise addition (no broadcasting).
Instances For
Gradients #
Learnable positional encoding is just an elementwise addition:
y = x + pos
So the adjoint is just:
δx = δyδpos = δy
This is trivial, but having it as a named spec makes higher-level models (e.g. ViT) easier to wire up without re-deriving the same one-liner everywhere.
Backward/VJP for add_positional_encoding_spec.
Instances For
Sinusoidal positional encodings (pure functions) #
These are the classic Transformer sinusoidal encodings from:
Vaswani et al. (2017), "Attention Is All You Need".
We implement them as pure tensor generators so they can be reused in multiple model specs without adding new trainable parameters.
Pure sinusoidal positional encoding tensor with shape (seqLen, embedDim).
Definition (Transformer):
PE[pos, 2i] = sin(θ(pos, i; embedDim))PE[pos, 2i+1] = cos(θ(pos, i; embedDim))
startPos is an offset for the absolute positions; use it when generating a chunk of positions
for cached decoding (e.g. tokens startPos .. startPos+seqLen-1).
This definition is total for all seqLen/embedDim:
- if
embedDim = 0, the inner dimension is empty, so no scalar computations are observed. - if
embedDimis odd, the last column uses the samei = floor(j/2)convention as usual.
Instances For
Add sinusoidal positional encodings: y = x + sinusoidal(startPos, seqLen, embedDim).
Instances For
Rotary positional embeddings (RoPE) utilities (pure functions) #
RoPE (Su et al., 2021) encodes position by applying a 2D rotation to each pair of features in the last dimension.
In most transformer implementations, RoPE is applied to query/key head vectors:
- per head:
(seqLen, headDim) - all heads:
(numHeads, seqLen, headDim)
This file intentionally provides only pure RoPE helpers; we do not integrate them into attention yet.
References:
- Su et al. (2021), "RoFormer: Enhanced Transformer with Rotary Position Embedding".
- Many implementations use the same
θ(pos,i;d)frequency schedule as sinusoidal PE.
Rotate pairs on the last dimension:
(x0, x1, x2, x3, ...) ↦ (-x1, x0, -x3, x2, ...).
This corresponds to multiplying each 2-vector (x_even, x_odd) by the matrix:
[[0, -1], [1, 0]].
Design note:
- Standard RoPE assumes
headDimis even. - This spec function is total: if
headDimis odd, the last (unpaired) entry is left unchanged.
Instances For
Broadcast RoPE cos(θ) factors to a full (headDim) vector for one position.
Instances For
Broadcast RoPE sin(θ) factors to a full (headDim) vector for one position.
Instances For
Apply RoPE to a single head matrix x : (seqLen, headDim).
Implementation matches the standard identity:
rope(x) = x * cos + rotatePairs(x) * sin
where cos and sin are position-dependent vectors broadcast across the last dimension.
startPos is an absolute-position offset (useful for KV-cache decoding).
Instances For
Apply RoPE to (numHeads, seqLen, headDim) by applying rope_apply_spec independently per head.