Embedding #
Spec-layer embedding primitives.
We model embeddings through single-scalar one-hot tensors: inputs have the same scalar type α
as the embedding matrix, so they compose cleanly with the rest of the tensor language.
If you want index-based embeddings (integer token ids) in runtime graphs, that lives at the TorchLean/session layer via Nat channels; the spec layer stays purely numeric by default.
References / analogies:
- In most ML frameworks, an embedding table is a matrix
W : (vocab x embedDim)and an index-based lookup returnsW[token_id]. One-hot embeddings are the equivalent linear maponeHot @ W(this file). - Bengio et al., "A Neural Probabilistic Language Model" (2003) for the classic embedding-table framing in neural language models.
- Mikolov et al., "Efficient Estimation of Word Representations in Vector Space" (2013) for the modern word-embedding perspective.
- PyTorch API docs:
torch.nn.Embedding: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.htmltorch.nn.functional.one_hot: https://pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html
Standard embedding weight matrix: vocab × embedDim.
- W : Tensor α (Shape.dim vocab (Shape.dim embedDim Shape.scalar))
W.
Instances For
Embed a batch/sequence of one-hot vectors:
oneHot : (seqLen × vocab) and W : (vocab × embedDim) gives (seqLen × embedDim).
Instances For
Gradients #
embedding_onehot_spec is matrix multiplication:
Y = oneHot @ W.
So the reverse-mode derivatives are the standard ones:
dOneHot = dY @ WᵀdW = oneHotᵀ @ dY
Even though "true" one-hot tensors are often treated as non-differentiable in practice, having a named VJP is useful for:
- treating embeddings as a pure linear map in proofs,
- debugging equivalences (one-hot vs index-based embeddings),
- and keeping this layer consistent with the rest of the spec library.
Backward/VJP for embedding_onehot_spec: returns (dOneHot, dW).