Seq2Seq (spec model) #
Encoder-decoder models for sequence generation.
This file supports both:
- discrete token id inputs (non‑differentiable lookup, good for runtime demos), and
- one‑hot / token‑distribution inputs (differentiable embedding via a matrix multiply).
PyTorch mental model:
- encoder:
nn.RNN/nn.LSTM(ornn.TransformerEncoder) over source token embeddings - decoder:
nn.RNNover target embeddings (teacher forcing in training), then a finalnn.Linearto vocabulary logits
What we deliberately keep simple:
- the optional attention in
Seq2SeqDecoderSpecis self-attention over the decoder inputs (a small variant you can toggle on/off); this file does not model encoder-decoder cross-attention in the main baseline. - for cross-attention style mechanisms, we include a small additive/Bahdanau-style attention at the
bottom of the file (
compute_attention_weights_spec/apply_attention_spec).
The transformer encoder blocks used by the transformer variant come from
NN/Spec/Models/Transformer.lean.
References:
- Sutskever et al., "Sequence to Sequence Learning with Neural Networks" (NeurIPS 2014).
- Bahdanau et al., "Neural Machine Translation by Jointly Learning to Align and Translate" (2015).
- Hochreiter and Schmidhuber, "Long Short-Term Memory" (1997).
- Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation" (2014).
- Vaswani et al., "Attention Is All You Need" (2017) for the transformer encoder variant.
- Srivastava et al., "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" (JMLR 2014).
PyTorch docs (for API intuition, not semantics):
torch.nn.Embedding: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.htmltorch.nn.RNN: https://pytorch.org/docs/stable/generated/torch.nn.RNN.htmltorch.nn.LSTM: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.htmltorch.nn.Linear: https://pytorch.org/docs/stable/generated/torch.nn.Linear.htmltorch.nn.MultiheadAttention: https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.htmltorch.nn.TransformerEncoderLayer: https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html
Training + gradients (one-hot inputs) #
Most of this file focuses on architecture variants and forward passes (teacher-forcing, inference-time decoding, optional self-attention in the decoder, etc.).
To make Seq2Seq usable as a first-class baseline, we also provide an explicit training objective and reverse-mode gradients for the differentiable path:
- inputs are one-hot / token distributions (so embedding lookup is a matrix multiply),
- teacher forcing is used in the decoder,
- the loss is per-timestep cross-entropy between
softmax(logits)and the target distribution, - gradients flow through embeddings, encoder RNN, decoder RNN, output projection, and (optionally) the decoder self-attention block.
Token-id based training (Tensor Nat) is still useful for demos, but it is intentionally treated as
non-differentiable.
Small gradient records #
Gradients for a time-distributed affine map y = x·Wᵀ + b.
This mirrors the parameters in LinearSpec and is used for the decoder output projection.
PyTorch analogue: the gradient pair for nn.Linear.
- dW : Tensor α (Shape.dim outDim (Shape.dim inDim Shape.scalar))
Gradient of the weight matrix
W. - db : Tensor α (Shape.dim outDim Shape.scalar)
Gradient of the bias vector
b.
Instances For
Gradients for an RNNSpec cell.
PyTorch analogue: the gradients for nn.RNN parameters (weight and bias).
- dW : WeightMatrix α hiddenSize (inputSize + hiddenSize)
Gradient of the concatenated input+hidden weight matrix.
- db : HiddenVector α hiddenSize
Gradient of the bias term.
Instances For
Gradients for a token embedding table E : (vocabSize × embedDim).
PyTorch analogue: nn.Embedding.weight.grad.
- d_embedding : Tensor α (Shape.dim vocabSize (Shape.dim embedDim Shape.scalar))
Gradient of the embedding matrix.
Instances For
End-to-end gradient record for the differentiable Seq2Seq baseline.
This bundles gradients for:
- source/target embeddings,
- encoder RNN,
- decoder RNN,
- decoder output projection,
- optional decoder self-attention (if enabled in the decoder spec).
- d_src_embedding : Seq2SeqEmbeddingGrads α srcVocabSize embedDim
Gradients for the source embedding table.
- d_tgt_embedding : Seq2SeqEmbeddingGrads α tgtVocabSize embedDim
Gradients for the target embedding table.
- d_encoder : Seq2SeqRNNGrads α embedDim hiddenDim
Gradients for the encoder RNN parameters.
- d_decoder_rnn : Seq2SeqRNNGrads α embedDim hiddenDim
Gradients for the decoder RNN parameters.
- d_output_projection : Seq2SeqLinearGrads α hiddenDim tgtVocabSize
Gradients for the decoder output projection (
hiddenDim -> tgtVocabSize). - d_decoder_attention : Option ((numHeads : ℕ) × MultiHeadAttentionGrads numHeads embedDim (embedDim / numHeads) α)
Gradients for optional decoder self-attention parameters.
Instances For
Seq2Seq token embedding specification.
Parameters:
embedding: a lookup tableE : (vocabSize × embedDim),dropout_rate: a scalarpused bydropout_inference_spec(deterministic scaling).
PyTorch analogue: nn.Embedding(vocabSize, embedDim) plus a nn.Dropout(p) applied to the
sequence of embeddings.
- embedding : Tensor α (Shape.dim vocabSize (Shape.dim embedDim Shape.scalar))
Embedding table
E : (vocabSize × embedDim). - dropout_rate : α
Dropout probability
p(used in a deterministic inference-style way).
Instances For
Embedding forward pass for discrete token ids.
Inputs:
token_ids : (seqLen), a tensor of natural-number token ids.
Output:
y : (seqLen × embedDim), where each timestep selects a row of the embedding table.
Out-of-range token ids map to a zero vector in this spec. We then apply
dropout_inference_spec deterministically (no RNG), which is useful for runtime demos.
PyTorch analogue: nn.Embedding on an integer tensor, followed by nn.Dropout(p) (but with
randomness replaced by deterministic scaling; see NN.Spec.Layers.Dropout).
Instances For
Seq2Seq embedding forward pass for one-hot / token distributions.
This is the usual "embedding lookup as a matrix multiply":
- if
E : (vocabSize × embedDim)is the embedding table, - and
x_t : (vocabSize)is a one-hot / probability vector for time stept, - then the embedded vector is
y_t = x_tᵀ · E : (embedDim).
PyTorch analogy: y = x @ E where x is one-hot / a distribution; this matches nn.Embedding
when the input is exactly one-hot.
Instances For
Backward pass for Seq2SeqEmbeddingSpec.forwardOnehot.
This is just a time-distributed linear layer:
y_t = token_tᵀ · E
So:
dE = Σ_t token_t ⊗ dY_tdToken_t = E · dY_t(not usually needed, but included for completeness)
Instances For
RNN-based encoder specification for Seq2Seq.
This models an nn.RNN-style encoder over embedded tokens:
- input is a sequence of embeddings
(seqLen × embedDim), - output is the full hidden-state sequence plus the final hidden state.
PyTorch analogue: nn.RNN(..., batch_first=True) (ignoring the batch axis), returning (output, h_n).
- rnn : RNNSpec α embedDim hiddenDim
RNN cell parameters.
- dropout_rate : α
Dropout probability
p(applied asdropout_inference_specto the input sequence).
Instances For
Forward pass for Seq2SeqRNNEncoderSpec.
Inputs:
x : (seqLen × embedDim), embedded source tokens,h0, optional initial hidden state (hiddenDim).
Returns:
(outputs, final_h)whereoutputs : (seqLen × hiddenDim)is the per-timestep hidden sequence.
Instances For
Instances For
LSTM-based encoder specification for Seq2Seq.
This models an nn.LSTM-style encoder over embedded tokens, returning the full hidden sequence,
final hidden state, and final cell state.
PyTorch analogue: nn.LSTM(..., batch_first=True) (ignoring the batch axis), returning
(output, (h_n, c_n)).
- lstm : LSTMSpec α embedDim hiddenDim
LSTM cell parameters.
- dropout_rate : α
Dropout probability
p(applied asdropout_inference_specto the input sequence).
Instances For
Forward pass for Seq2SeqLSTMEncoderSpec.
Inputs:
x : (seqLen × embedDim), embedded source tokens,h0, optional initial hidden state (hiddenDim),c0, optional initial cell state (hiddenDim).
Returns:
(outputs, final_h, final_c)whereoutputs : (seqLen × hiddenDim)is the per-timestep hidden sequence.
Instances For
Instances For
Transformer-based encoder specification for Seq2Seq.
This is a lightweight wrapper around a list of TransformerEncoderLayers from
NN.Spec.Models.Transformer, applied as a left-fold.
PyTorch analogue: nn.TransformerEncoder(nn.TransformerEncoderLayer(...), num_layers=...)
(ignoring dropout and most configuration knobs).
- layers : List (TransformerEncoderLayer numHeads embedDim (embedDim * 4) α)
Encoder layer stack; typically length
numLayers, but not enforced by the spec. - dropout_rate : α
Dropout probability
p(applied asdropout_inference_specto the input sequence).
Instances For
Forward pass for Seq2SeqTransformerEncoderSpec.
Input/output shape: (seqLen × embedDim).
This uses post-norm transformer layers from NN.Spec.Models.Transformer and does not model
dropout; it is meant as a clean semantic reference rather than a full training-ready implementation.
Instances For
RNN decoder specification for Seq2Seq.
This decoder consumes a sequence of target-side embeddings and produces vocabulary logits:
- an
RNNSpeccell updates the hidden state per timestep, - a time-distributed
LinearSpecmaps hidden states to logits, - optionally, a self-attention block can be applied over the decoder input embeddings before the RNN.
PyTorch analogue: a hand-rolled decoder using nn.RNN and nn.Linear, optionally preceded by
nn.MultiheadAttention over the target embeddings (note: this is not encoder-decoder
cross-attention).
- rnn : RNNSpec α embedDim hiddenDim
Decoder RNN cell parameters.
- attention : Option ((numHeads : ℕ) × MultiHeadAttention α numHeads embedDim (embedDim / numHeads))
Optional self-attention block over decoder input embeddings.
- output_projection : LinearSpec α hiddenDim vocabSize
Output projection (
hiddenDim -> vocabSize) producing per-timestep logits. - dropout_rate : α
Dropout probability
p(applied asdropout_inference_specto decoder input embeddings).
Instances For
Seq2Seq decoder forward pass (teacher forcing)
target_embeddings: Tensor of shape (tgtSeqLen × embedDim)h0: initial hidden state (hiddenDim)- Returns: Tensor of shape (tgtSeqLen × vocabSize)
- If
decoder.attentionissome, this runs self-attention overtarget_embeddingsand feeds the attended embedding at each timestep. - Note: this spec does not model cross-attention over encoder outputs.
Instances For
Instances For
Decoder backward (teacher forcing) #
The decoder is: (optional self-attention) → RNN → time-distributed linear projection.
We compute gradients by:
- recomputing the attended embeddings (if any),
- recomputing the decoder hidden sequence,
- backpropagating through the output projection per timestep,
- backpropagating through the RNN sequence,
- optionally backpropagating through self-attention.
Backward pass for a time-distributed LinearSpec.
Given a hidden-state sequence hiddens : (tgtSeqLen × hiddenDim) and upstream gradients
grad_logits : (tgtSeqLen × vocabSize), computes:
- accumulated parameter gradients for the shared
LinearSpec, - gradients w.r.t. each hidden state
(tgtSeqLen × hiddenDim).
PyTorch analogue: backprop through nn.Linear applied at each timestep.
Instances For
Backward pass for Seq2SeqDecoderSpec.forwardTeacherForcing.
Returns:
- RNN parameter gradients,
- output projection gradients,
- optional self-attention parameter gradients,
- gradient w.r.t. the target embeddings sequence,
- gradient w.r.t. the initial hidden state
h0.
Implementation note: this spec recomputes the attended embeddings and hidden sequence to keep the backward pass self-contained (no mutable tape).
Instances For
Seq2Seq decoder forward pass (inference-time autoregressive decoding).
This runs a greedy decoding loop for maxLen steps, starting from:
- an initial hidden state
h0, - a starting input embedding vector (
start_tokenalready embedded), - and a target embedding table
tgt_embeddingused to embed the predicted next token id.
Returns:
- the per-step logits
(maxLen × vocabSize), - the greedy-decoded token ids
(maxLen).
PyTorch analogue: a manual decoding loop using nn.RNNCell/nn.RNN + nn.Linear, with
argmax sampling and embedding lookup each step.
Note: decoder.attention is only modeled in the teacher-forcing forward/backward in this file; the
greedy decoding loop below does not implement autoregressive self-attention.
Instances For
Instances For
Complete Seq2Seq model specification (baseline).
This bundles:
- source and target embedding tables,
- an RNN encoder,
- an RNN decoder with output projection (and optional decoder self-attention).
PyTorch analogue: a small encoder-decoder model built from nn.Embedding, nn.RNN, and
nn.Linear.
- src_embedding : Seq2SeqEmbeddingSpec α srcVocabSize embedDim
Source embedding table + dropout configuration.
- tgt_embedding : Seq2SeqEmbeddingSpec α tgtVocabSize embedDim
Target embedding table + dropout configuration.
- encoder : Seq2SeqRNNEncoderSpec α embedDim hiddenDim
Encoder RNN parameters.
- decoder : Seq2SeqDecoderSpec α embedDim hiddenDim tgtVocabSize
Decoder parameters (RNN + output projection + optional self-attention).
Instances For
Seq2Seq forward pass for training (teacher forcing) using discrete token ids.
Inputs:
src_tokens : (srcSeqLen)andtgt_tokens : (tgtSeqLen)are token id tensors.
Output:
- logits of shape
(tgtSeqLen × tgtVocabSize).
This path is convenient for runtime demos but intentionally treated as non-differentiable (token-id embedding lookup is not modeled as a differentiable op).
Instances For
Seq2Seq forward pass for inference-time decoding using discrete token ids.
This embeds the source token ids, encodes them to get an initial decoder hidden state, then runs
greedy decoding for maxTgtLen steps starting from the given start_token.
Returns:
- logits
(maxTgtLen × tgtVocabSize), - greedy-decoded token ids
(maxTgtLen).
Instances For
Differentiable training + backward (one-hot inputs) #
This is the “full” training interface for the Seq2Seq baseline.
Differentiable forward pass for training (teacher forcing) using one-hot/token-distribution inputs.
This is the same computation as Seq2SeqSpec.forwardTraining, except that embedding lookup is
expressed as a matrix multiplication (forwardOnehot), so gradients can flow into the embedding
tables and back into upstream token distributions (if desired).
Instances For
Per-timestep cross-entropy loss for the differentiable Seq2Seq baseline.
Computes:
- logits via
Seq2SeqSpec.forwardTrainingOnehot, - probabilities via
softmax, - cross-entropy against the target token distribution at each timestep.
PyTorch analogue: nn.CrossEntropyLoss applied per timestep (with probabilities represented as
one-hot).
Instances For
Compute (loss, grads) for the Seq2Seq baseline under per-timestep cross-entropy.
This returns gradients for:
- both embedding tables,
- the encoder RNN,
- the decoder RNN,
- the decoder output projection,
- and decoder self-attention (if present).
Instances For
Attention-augmented Seq2Seq specification (simple encoder-output attention).
This record extends the baseline with an additional projection matrix used by the helper
attention functions below (compute_attention_weights_spec / apply_attention_spec).
Note: this file includes these attention helpers as a building block; the main baseline forward passes above do not integrate encoder-decoder cross-attention by default.
- src_embedding : Seq2SeqEmbeddingSpec α srcVocabSize embedDim
Source embedding table + dropout configuration.
- tgt_embedding : Seq2SeqEmbeddingSpec α tgtVocabSize embedDim
Target embedding table + dropout configuration.
- encoder : Seq2SeqRNNEncoderSpec α embedDim hiddenDim
Encoder RNN parameters.
- decoder : Seq2SeqDecoderSpec α embedDim hiddenDim tgtVocabSize
Decoder parameters (RNN + output projection + optional self-attention).
- attention_weights : Tensor α (Shape.dim hiddenDim (Shape.dim hiddenDim Shape.scalar))
Attention projection matrix used to score encoder outputs against the decoder hidden state.
Instances For
Compute attention weights over encoder outputs for a single decoder hidden state.
This is a simple dot-product style attention:
- project the decoder hidden state (
attention_weights · decoder_hidden), - score each encoder hidden vector by an elementwise product + sum,
- normalize scores with
softmaxover the sequence axis.
It is inspired by classic encoder-decoder attention mechanisms (Bahdanau-style), but is intentionally kept compact in this spec.
Instances For
Apply attention weights to encoder outputs (weighted sum / context vector).
Given attention weights a : (seqLen) and encoder outputs H : (seqLen × hiddenDim), returns the
context vector c = Σ_i a_i · H_i : (hiddenDim).