Seq2Seq (spec model) #

Encoder-decoder models for sequence generation.

This file supports both:

discrete token id inputs (non‑differentiable lookup, good for runtime demos), and
one‑hot / token‑distribution inputs (differentiable embedding via a matrix multiply).

PyTorch mental model:

encoder: nn.RNN / nn.LSTM (or nn.TransformerEncoder) over source token embeddings
decoder: nn.RNN over target embeddings (teacher forcing in training), then a final nn.Linear to vocabulary logits

What we deliberately keep simple:

the optional attention in Seq2SeqDecoderSpec is self-attention over the decoder inputs (a small variant you can toggle on/off); this file does not model encoder-decoder cross-attention in the main baseline.
for cross-attention style mechanisms, we include a small additive/Bahdanau-style attention at the bottom of the file (compute_attention_weights_spec / apply_attention_spec).

The transformer encoder blocks used by the transformer variant come from NN/Spec/Models/Transformer.lean.

References:

Sutskever et al., "Sequence to Sequence Learning with Neural Networks" (NeurIPS 2014).
Bahdanau et al., "Neural Machine Translation by Jointly Learning to Align and Translate" (2015).
Hochreiter and Schmidhuber, "Long Short-Term Memory" (1997).
Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation" (2014).
Vaswani et al., "Attention Is All You Need" (2017) for the transformer encoder variant.
Srivastava et al., "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" (JMLR 2014).

PyTorch docs (for API intuition, not semantics):

torch.nn.Embedding: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
torch.nn.RNN: https://pytorch.org/docs/stable/generated/torch.nn.RNN.html
torch.nn.LSTM: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
torch.nn.Linear: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html
torch.nn.MultiheadAttention: https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html
torch.nn.TransformerEncoderLayer: https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html

Training + gradients (one-hot inputs) #

Most of this file focuses on architecture variants and forward passes (teacher-forcing, inference-time decoding, optional self-attention in the decoder, etc.).

To make Seq2Seq usable as a first-class baseline, we also provide an explicit training objective and reverse-mode gradients for the differentiable path:

inputs are one-hot / token distributions (so embedding lookup is a matrix multiply),
teacher forcing is used in the decoder,
the loss is per-timestep cross-entropy between softmax(logits) and the target distribution,
gradients flow through embeddings, encoder RNN, decoder RNN, output projection, and (optionally) the decoder self-attention block.

Token-id based training (Tensor Nat) is still useful for demos, but it is intentionally treated as non-differentiable.

Seq2Seq (spec model) #

Training + gradients (one-hot inputs) #

Small gradient records #

Decoder backward (teacher forcing) #

Differentiable training + backward (one-hot inputs) #