TorchLean API

NN.Spec.Models.Seq2seq

Seq2Seq (spec model) #

Encoder-decoder models for sequence generation.

This file supports both:

PyTorch mental model:

What we deliberately keep simple:

The transformer encoder blocks used by the transformer variant come from NN/Spec/Models/Transformer.lean.

References:

PyTorch docs (for API intuition, not semantics):

Training + gradients (one-hot inputs) #

Most of this file focuses on architecture variants and forward passes (teacher-forcing, inference-time decoding, optional self-attention in the decoder, etc.).

To make Seq2Seq usable as a first-class baseline, we also provide an explicit training objective and reverse-mode gradients for the differentiable path:

Token-id based training (Tensor Nat) is still useful for demos, but it is intentionally treated as non-differentiable.

Small gradient records #

structure Spec.Seq2SeqLinearGrads (α : Type) (inDim outDim : ) :

Gradients for a time-distributed affine map y = x·Wᵀ + b.

This mirrors the parameters in LinearSpec and is used for the decoder output projection. PyTorch analogue: the gradient pair for nn.Linear.

Instances For
    structure Spec.Seq2SeqRNNGrads (α : Type) (inputSize hiddenSize : ) :

    Gradients for an RNNSpec cell.

    PyTorch analogue: the gradients for nn.RNN parameters (weight and bias).

    • dW : WeightMatrix α hiddenSize (inputSize + hiddenSize)

      Gradient of the concatenated input+hidden weight matrix.

    • db : HiddenVector α hiddenSize

      Gradient of the bias term.

    Instances For
      structure Spec.Seq2SeqEmbeddingGrads (α : Type) (vocabSize embedDim : ) :

      Gradients for a token embedding table E : (vocabSize × embedDim).

      PyTorch analogue: nn.Embedding.weight.grad.

      Instances For
        structure Spec.Seq2SeqGrads (α : Type) (srcVocabSize tgtVocabSize embedDim hiddenDim : ) :

        End-to-end gradient record for the differentiable Seq2Seq baseline.

        This bundles gradients for:

        • source/target embeddings,
        • encoder RNN,
        • decoder RNN,
        • decoder output projection,
        • optional decoder self-attention (if enabled in the decoder spec).
        • d_src_embedding : Seq2SeqEmbeddingGrads α srcVocabSize embedDim

          Gradients for the source embedding table.

        • d_tgt_embedding : Seq2SeqEmbeddingGrads α tgtVocabSize embedDim

          Gradients for the target embedding table.

        • d_encoder : Seq2SeqRNNGrads α embedDim hiddenDim

          Gradients for the encoder RNN parameters.

        • d_decoder_rnn : Seq2SeqRNNGrads α embedDim hiddenDim

          Gradients for the decoder RNN parameters.

        • d_output_projection : Seq2SeqLinearGrads α hiddenDim tgtVocabSize

          Gradients for the decoder output projection (hiddenDim -> tgtVocabSize).

        • d_decoder_attention : Option ((numHeads : ) × MultiHeadAttentionGrads numHeads embedDim (embedDim / numHeads) α)

          Gradients for optional decoder self-attention parameters.

        Instances For
          structure Spec.Seq2SeqEmbeddingSpec (α : Type) [Numbers α] (vocabSize embedDim : ) :

          Seq2Seq token embedding specification.

          Parameters:

          • embedding: a lookup table E : (vocabSize × embedDim),
          • dropout_rate: a scalar p used by dropout_inference_spec (deterministic scaling).

          PyTorch analogue: nn.Embedding(vocabSize, embedDim) plus a nn.Dropout(p) applied to the sequence of embeddings.

          • embedding : Tensor α (Shape.dim vocabSize (Shape.dim embedDim Shape.scalar))

            Embedding table E : (vocabSize × embedDim).

          • dropout_rate : α

            Dropout probability p (used in a deterministic inference-style way).

          Instances For
            def Spec.Seq2SeqEmbeddingSpec.forward {α : Type} [Context α] {vocabSize embedDim seqLen : } (embedding : Seq2SeqEmbeddingSpec α vocabSize embedDim) (token_ids : Tensor (Shape.dim seqLen Shape.scalar)) :
            Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))

            Embedding forward pass for discrete token ids.

            Inputs:

            • token_ids : (seqLen), a tensor of natural-number token ids.

            Output:

            • y : (seqLen × embedDim), where each timestep selects a row of the embedding table.

            Out-of-range token ids map to a zero vector in this spec. We then apply dropout_inference_spec deterministically (no RNG), which is useful for runtime demos.

            PyTorch analogue: nn.Embedding on an integer tensor, followed by nn.Dropout(p) (but with randomness replaced by deterministic scaling; see NN.Spec.Layers.Dropout).

            Instances For
              def Spec.Seq2SeqEmbeddingSpec.forwardOnehot {α : Type} [Context α] {vocabSize embedDim seqLen : } (embedding : Seq2SeqEmbeddingSpec α vocabSize embedDim) (token_onehot : Tensor α (Shape.dim seqLen (Shape.dim vocabSize Shape.scalar))) :
              Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))

              Seq2Seq embedding forward pass for one-hot / token distributions.

              This is the usual "embedding lookup as a matrix multiply":

              • if E : (vocabSize × embedDim) is the embedding table,
              • and x_t : (vocabSize) is a one-hot / probability vector for time step t,
              • then the embedded vector is y_t = x_tᵀ · E : (embedDim).

              PyTorch analogy: y = x @ E where x is one-hot / a distribution; this matches nn.Embedding when the input is exactly one-hot.

              Instances For
                def Spec.Seq2SeqEmbeddingSpec.backwardOnehot {α : Type} [Context α] {vocabSize embedDim seqLen : } (embedding : Seq2SeqEmbeddingSpec α vocabSize embedDim) (token_onehot : Tensor α (Shape.dim seqLen (Shape.dim vocabSize Shape.scalar))) (grad_output : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))) :
                Seq2SeqEmbeddingGrads α vocabSize embedDim × Tensor α (Shape.dim seqLen (Shape.dim vocabSize Shape.scalar))

                Backward pass for Seq2SeqEmbeddingSpec.forwardOnehot.

                This is just a time-distributed linear layer:

                y_t = token_tᵀ · E

                So:

                • dE = Σ_t token_t ⊗ dY_t
                • dToken_t = E · dY_t (not usually needed, but included for completeness)
                Instances For
                  structure Spec.Seq2SeqRNNEncoderSpec (α : Type) [Numbers α] (embedDim hiddenDim : ) :

                  RNN-based encoder specification for Seq2Seq.

                  This models an nn.RNN-style encoder over embedded tokens:

                  • input is a sequence of embeddings (seqLen × embedDim),
                  • output is the full hidden-state sequence plus the final hidden state.

                  PyTorch analogue: nn.RNN(..., batch_first=True) (ignoring the batch axis), returning (output, h_n).

                  • rnn : RNNSpec α embedDim hiddenDim

                    RNN cell parameters.

                  • dropout_rate : α

                    Dropout probability p (applied as dropout_inference_spec to the input sequence).

                  Instances For
                    def Spec.Seq2SeqRNNEncoderSpec.forward {α : Type} [Context α] {embedDim hiddenDim seqLen : } (encoder : Seq2SeqRNNEncoderSpec α embedDim hiddenDim) (x : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))) (h0 : Option (Tensor α (Shape.dim hiddenDim Shape.scalar))) :
                    Tensor α (Shape.dim seqLen (Shape.dim hiddenDim Shape.scalar)) × Tensor α (Shape.dim hiddenDim Shape.scalar)

                    Forward pass for Seq2SeqRNNEncoderSpec.

                    Inputs:

                    • x : (seqLen × embedDim), embedded source tokens,
                    • h0, optional initial hidden state (hiddenDim).

                    Returns:

                    • (outputs, final_h) where outputs : (seqLen × hiddenDim) is the per-timestep hidden sequence.
                    Instances For
                      @[irreducible]
                      def Spec.Seq2SeqRNNEncoderSpec.forward.process_sequence {α : Type} [Context α] {embedDim hiddenDim seqLen : } (encoder : Seq2SeqRNNEncoderSpec α embedDim hiddenDim) (x : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))) (i : ) (h : Tensor α (Shape.dim hiddenDim Shape.scalar)) (acc : List (Tensor α (Shape.dim hiddenDim Shape.scalar))) (h1 : i seqLen) :
                      Instances For
                        structure Spec.Seq2SeqLSTMEncoderSpec (α : Type) [Numbers α] (embedDim hiddenDim : ) :

                        LSTM-based encoder specification for Seq2Seq.

                        This models an nn.LSTM-style encoder over embedded tokens, returning the full hidden sequence, final hidden state, and final cell state.

                        PyTorch analogue: nn.LSTM(..., batch_first=True) (ignoring the batch axis), returning (output, (h_n, c_n)).

                        • lstm : LSTMSpec α embedDim hiddenDim

                          LSTM cell parameters.

                        • dropout_rate : α

                          Dropout probability p (applied as dropout_inference_spec to the input sequence).

                        Instances For
                          def Spec.Seq2SeqLSTMEncoderSpec.forward {α : Type} [Context α] {embedDim hiddenDim seqLen : } (encoder : Seq2SeqLSTMEncoderSpec α embedDim hiddenDim) (x : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))) (h0 c0 : Option (Tensor α (Shape.dim hiddenDim Shape.scalar))) :
                          Tensor α (Shape.dim seqLen (Shape.dim hiddenDim Shape.scalar)) × Tensor α (Shape.dim hiddenDim Shape.scalar) × Tensor α (Shape.dim hiddenDim Shape.scalar)

                          Forward pass for Seq2SeqLSTMEncoderSpec.

                          Inputs:

                          • x : (seqLen × embedDim), embedded source tokens,
                          • h0, optional initial hidden state (hiddenDim),
                          • c0, optional initial cell state (hiddenDim).

                          Returns:

                          • (outputs, final_h, final_c) where outputs : (seqLen × hiddenDim) is the per-timestep hidden sequence.
                          Instances For
                            @[irreducible]
                            def Spec.Seq2SeqLSTMEncoderSpec.forward.process_sequence {α : Type} [Context α] {embedDim hiddenDim seqLen : } (encoder : Seq2SeqLSTMEncoderSpec α embedDim hiddenDim) (x : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))) (i : ) (h c : Tensor α (Shape.dim hiddenDim Shape.scalar)) (acc : List (Tensor α (Shape.dim hiddenDim Shape.scalar))) (h1 : i seqLen) :
                            Instances For
                              structure Spec.Seq2SeqTransformerEncoderSpec (α : Type) [Context α] [Numbers α] (embedDim numHeads numLayers : ) :

                              Transformer-based encoder specification for Seq2Seq.

                              This is a lightweight wrapper around a list of TransformerEncoderLayers from NN.Spec.Models.Transformer, applied as a left-fold.

                              PyTorch analogue: nn.TransformerEncoder(nn.TransformerEncoderLayer(...), num_layers=...) (ignoring dropout and most configuration knobs).

                              • layers : List (TransformerEncoderLayer numHeads embedDim (embedDim * 4) α)

                                Encoder layer stack; typically length numLayers, but not enforced by the spec.

                              • dropout_rate : α

                                Dropout probability p (applied as dropout_inference_spec to the input sequence).

                              Instances For
                                def Spec.Seq2SeqTransformerEncoderSpec.forward {α : Type} [Context α] {embedDim numHeads numLayers seqLen : } (encoder : Seq2SeqTransformerEncoderSpec α embedDim numHeads numLayers) (x : Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))) (h1 : seqLen > 0) (h2 : embedDim > 0) :
                                Tensor α (Shape.dim seqLen (Shape.dim embedDim Shape.scalar))

                                Forward pass for Seq2SeqTransformerEncoderSpec.

                                Input/output shape: (seqLen × embedDim).

                                This uses post-norm transformer layers from NN.Spec.Models.Transformer and does not model dropout; it is meant as a clean semantic reference rather than a full training-ready implementation.

                                Instances For
                                  structure Spec.Seq2SeqDecoderSpec (α : Type) [Numbers α] (embedDim hiddenDim vocabSize : ) :

                                  RNN decoder specification for Seq2Seq.

                                  This decoder consumes a sequence of target-side embeddings and produces vocabulary logits:

                                  • an RNNSpec cell updates the hidden state per timestep,
                                  • a time-distributed LinearSpec maps hidden states to logits,
                                  • optionally, a self-attention block can be applied over the decoder input embeddings before the RNN.

                                  PyTorch analogue: a hand-rolled decoder using nn.RNN and nn.Linear, optionally preceded by nn.MultiheadAttention over the target embeddings (note: this is not encoder-decoder cross-attention).

                                  • rnn : RNNSpec α embedDim hiddenDim

                                    Decoder RNN cell parameters.

                                  • attention : Option ((numHeads : ) × MultiHeadAttention α numHeads embedDim (embedDim / numHeads))

                                    Optional self-attention block over decoder input embeddings.

                                  • output_projection : LinearSpec α hiddenDim vocabSize

                                    Output projection (hiddenDim -> vocabSize) producing per-timestep logits.

                                  • dropout_rate : α

                                    Dropout probability p (applied as dropout_inference_spec to decoder input embeddings).

                                  Instances For
                                    def Spec.Seq2SeqDecoderSpec.forwardTeacherForcing {α : Type} [Context α] {embedDim hiddenDim vocabSize tgtSeqLen : } (decoder : Seq2SeqDecoderSpec α embedDim hiddenDim vocabSize) (target_embeddings : Tensor α (Shape.dim tgtSeqLen (Shape.dim embedDim Shape.scalar))) (h0 : Tensor α (Shape.dim hiddenDim Shape.scalar)) (h_len_nonzero : tgtSeqLen 0) :
                                    Tensor α (Shape.dim tgtSeqLen (Shape.dim vocabSize Shape.scalar))

                                    Seq2Seq decoder forward pass (teacher forcing)

                                    • target_embeddings : Tensor of shape (tgtSeqLen × embedDim)
                                    • h0 : initial hidden state (hiddenDim)
                                    • Returns: Tensor of shape (tgtSeqLen × vocabSize)
                                    • If decoder.attention is some, this runs self-attention over target_embeddings and feeds the attended embedding at each timestep.
                                    • Note: this spec does not model cross-attention over encoder outputs.
                                    Instances For
                                      @[irreducible]
                                      def Spec.Seq2SeqDecoderSpec.forwardTeacherForcing.process_sequence {α : Type} [Context α] {embedDim hiddenDim vocabSize tgtSeqLen : } (decoder : Seq2SeqDecoderSpec α embedDim hiddenDim vocabSize) (attended_embeddings : Tensor α (Shape.dim tgtSeqLen (Shape.dim embedDim Shape.scalar))) (i : ) (h : Tensor α (Shape.dim hiddenDim Shape.scalar)) (acc : List (Tensor α (Shape.dim vocabSize Shape.scalar))) (h_bound : i tgtSeqLen) :
                                      Instances For

                                        Decoder backward (teacher forcing) #

                                        The decoder is: (optional self-attention) → RNN → time-distributed linear projection.

                                        We compute gradients by:

                                        1. recomputing the attended embeddings (if any),
                                        2. recomputing the decoder hidden sequence,
                                        3. backpropagating through the output projection per timestep,
                                        4. backpropagating through the RNN sequence,
                                        5. optionally backpropagating through self-attention.
                                        def Spec.timeDistributedLinearBackward {α : Type} [Context α] {tgtSeqLen hiddenDim vocabSize : } (layer : LinearSpec α hiddenDim vocabSize) (hiddens : Tensor α (Shape.dim tgtSeqLen (Shape.dim hiddenDim Shape.scalar))) (grad_logits : Tensor α (Shape.dim tgtSeqLen (Shape.dim vocabSize Shape.scalar))) :
                                        Seq2SeqLinearGrads α hiddenDim vocabSize × Tensor α (Shape.dim tgtSeqLen (Shape.dim hiddenDim Shape.scalar))

                                        Backward pass for a time-distributed LinearSpec.

                                        Given a hidden-state sequence hiddens : (tgtSeqLen × hiddenDim) and upstream gradients grad_logits : (tgtSeqLen × vocabSize), computes:

                                        • accumulated parameter gradients for the shared LinearSpec,
                                        • gradients w.r.t. each hidden state (tgtSeqLen × hiddenDim).

                                        PyTorch analogue: backprop through nn.Linear applied at each timestep.

                                        Instances For
                                          def Spec.Seq2SeqDecoderSpec.backwardTeacherForcing {α : Type} [Context α] {embedDim hiddenDim vocabSize tgtSeqLen : } (decoder : Seq2SeqDecoderSpec α embedDim hiddenDim vocabSize) (target_embeddings : Tensor α (Shape.dim tgtSeqLen (Shape.dim embedDim Shape.scalar))) (h0 : Tensor α (Shape.dim hiddenDim Shape.scalar)) (h_len_nonzero : tgtSeqLen 0) (grad_logits : Tensor α (Shape.dim tgtSeqLen (Shape.dim vocabSize Shape.scalar))) :
                                          Seq2SeqRNNGrads α embedDim hiddenDim × Seq2SeqLinearGrads α hiddenDim vocabSize × Option ((numHeads : ) × MultiHeadAttentionGrads numHeads embedDim (embedDim / numHeads) α) × Tensor α (Shape.dim tgtSeqLen (Shape.dim embedDim Shape.scalar)) × Tensor α (Shape.dim hiddenDim Shape.scalar)

                                          Backward pass for Seq2SeqDecoderSpec.forwardTeacherForcing.

                                          Returns:

                                          • RNN parameter gradients,
                                          • output projection gradients,
                                          • optional self-attention parameter gradients,
                                          • gradient w.r.t. the target embeddings sequence,
                                          • gradient w.r.t. the initial hidden state h0.

                                          Implementation note: this spec recomputes the attended embeddings and hidden sequence to keep the backward pass self-contained (no mutable tape).

                                          Instances For
                                            def Spec.Seq2SeqDecoderSpec.forwardInference {α : Type} [Context α] {embedDim hiddenDim vocabSize : } (decoder : Seq2SeqDecoderSpec α embedDim hiddenDim vocabSize) (h0 : Tensor α (Shape.dim hiddenDim Shape.scalar)) (start_token : Tensor α (Shape.dim embedDim Shape.scalar)) (tgt_embedding : Tensor α (Shape.dim vocabSize (Shape.dim embedDim Shape.scalar))) (maxLen : ) (h_len : maxLen 0) (h3 : vocabSize 0) :

                                            Seq2Seq decoder forward pass (inference-time autoregressive decoding).

                                            This runs a greedy decoding loop for maxLen steps, starting from:

                                            • an initial hidden state h0,
                                            • a starting input embedding vector (start_token already embedded),
                                            • and a target embedding table tgt_embedding used to embed the predicted next token id.

                                            Returns:

                                            • the per-step logits (maxLen × vocabSize),
                                            • the greedy-decoded token ids (maxLen).

                                            PyTorch analogue: a manual decoding loop using nn.RNNCell/nn.RNN + nn.Linear, with argmax sampling and embedding lookup each step.

                                            Note: decoder.attention is only modeled in the teacher-forcing forward/backward in this file; the greedy decoding loop below does not implement autoregressive self-attention.

                                            Instances For
                                              @[irreducible]
                                              def Spec.Seq2SeqDecoderSpec.forwardInference.process_sequence {α : Type} [Context α] {embedDim hiddenDim vocabSize : } (decoder : Seq2SeqDecoderSpec α embedDim hiddenDim vocabSize) (tgt_embedding : Tensor α (Shape.dim vocabSize (Shape.dim embedDim Shape.scalar))) (maxLen i : ) (h : Tensor α (Shape.dim hiddenDim Shape.scalar)) (current_input : Tensor α (Shape.dim embedDim Shape.scalar)) (acc_logits : List (Tensor α (Shape.dim vocabSize Shape.scalar))) (acc_tokens : List ) (h_bound : i maxLen) (h_len : maxLen 0) (h_vocab : vocabSize > 0) :
                                              Instances For
                                                @[irreducible]
                                                def Spec.Seq2SeqDecoderSpec.forwardInference.process_sequence.find_max {α : Type} [Context α] {vocabSize : } (logits : Fin vocabSizeTensor α Shape.scalar) (j : ) (max_val : α) (max_idx : ) :
                                                Instances For
                                                  structure Spec.Seq2SeqSpec (α : Type) [Numbers α] (srcVocabSize tgtVocabSize embedDim hiddenDim : ) :

                                                  Complete Seq2Seq model specification (baseline).

                                                  This bundles:

                                                  • source and target embedding tables,
                                                  • an RNN encoder,
                                                  • an RNN decoder with output projection (and optional decoder self-attention).

                                                  PyTorch analogue: a small encoder-decoder model built from nn.Embedding, nn.RNN, and nn.Linear.

                                                  • src_embedding : Seq2SeqEmbeddingSpec α srcVocabSize embedDim

                                                    Source embedding table + dropout configuration.

                                                  • tgt_embedding : Seq2SeqEmbeddingSpec α tgtVocabSize embedDim

                                                    Target embedding table + dropout configuration.

                                                  • encoder : Seq2SeqRNNEncoderSpec α embedDim hiddenDim

                                                    Encoder RNN parameters.

                                                  • decoder : Seq2SeqDecoderSpec α embedDim hiddenDim tgtVocabSize

                                                    Decoder parameters (RNN + output projection + optional self-attention).

                                                  Instances For
                                                    def Spec.Seq2SeqSpec.forwardTraining {α : Type} [Context α] {srcVocabSize tgtVocabSize embedDim hiddenDim srcSeqLen tgtSeqLen : } (model : Seq2SeqSpec α srcVocabSize tgtVocabSize embedDim hiddenDim) (src_tokens : Tensor (Shape.dim srcSeqLen Shape.scalar)) (tgt_tokens : Tensor (Shape.dim tgtSeqLen Shape.scalar)) (_h1 : srcVocabSize 0) (_h2 : tgtVocabSize 0) (_h3 : embedDim 0) (_h4 : hiddenDim 0) (_h5 : srcSeqLen 0) (h6 : tgtSeqLen 0) :
                                                    Tensor α (Shape.dim tgtSeqLen (Shape.dim tgtVocabSize Shape.scalar))

                                                    Seq2Seq forward pass for training (teacher forcing) using discrete token ids.

                                                    Inputs:

                                                    • src_tokens : (srcSeqLen) and tgt_tokens : (tgtSeqLen) are token id tensors.

                                                    Output:

                                                    • logits of shape (tgtSeqLen × tgtVocabSize).

                                                    This path is convenient for runtime demos but intentionally treated as non-differentiable (token-id embedding lookup is not modeled as a differentiable op).

                                                    Instances For
                                                      def Spec.Seq2SeqSpec.forwardInference {α : Type} [Context α] {srcVocabSize tgtVocabSize embedDim hiddenDim srcSeqLen : } (maxTgtLen : ) (model : Seq2SeqSpec α srcVocabSize tgtVocabSize embedDim hiddenDim) (src_tokens : Tensor (Shape.dim srcSeqLen Shape.scalar)) (start_token : ) (_h1 : srcVocabSize 0) (h2 : tgtVocabSize 0) (_h3 : embedDim 0) (_h4 : hiddenDim 0) (_h5 : srcSeqLen 0) (h6 : maxTgtLen 0) :
                                                      Tensor α (Shape.dim maxTgtLen (Shape.dim tgtVocabSize Shape.scalar)) × Tensor (Shape.dim maxTgtLen Shape.scalar)

                                                      Seq2Seq forward pass for inference-time decoding using discrete token ids.

                                                      This embeds the source token ids, encodes them to get an initial decoder hidden state, then runs greedy decoding for maxTgtLen steps starting from the given start_token.

                                                      Returns:

                                                      • logits (maxTgtLen × tgtVocabSize),
                                                      • greedy-decoded token ids (maxTgtLen).
                                                      Instances For

                                                        Differentiable training + backward (one-hot inputs) #

                                                        This is the “full” training interface for the Seq2Seq baseline.

                                                        def Spec.Seq2SeqSpec.forwardTrainingOnehot {α : Type} [Context α] {srcVocabSize tgtVocabSize embedDim hiddenDim srcSeqLen tgtSeqLen : } (model : Seq2SeqSpec α srcVocabSize tgtVocabSize embedDim hiddenDim) (src_onehot : Tensor α (Shape.dim srcSeqLen (Shape.dim srcVocabSize Shape.scalar))) (tgt_onehot : Tensor α (Shape.dim tgtSeqLen (Shape.dim tgtVocabSize Shape.scalar))) (_hSrc : srcSeqLen 0) (hTgt : tgtSeqLen 0) :
                                                        Tensor α (Shape.dim tgtSeqLen (Shape.dim tgtVocabSize Shape.scalar))

                                                        Differentiable forward pass for training (teacher forcing) using one-hot/token-distribution inputs.

                                                        This is the same computation as Seq2SeqSpec.forwardTraining, except that embedding lookup is expressed as a matrix multiplication (forwardOnehot), so gradients can flow into the embedding tables and back into upstream token distributions (if desired).

                                                        Instances For
                                                          def Spec.Seq2SeqSpec.crossEntropyLossOnehot {α : Type} [Context α] {srcVocabSize tgtVocabSize embedDim hiddenDim srcSeqLen tgtSeqLen : } (model : Seq2SeqSpec α srcVocabSize tgtVocabSize embedDim hiddenDim) (src_onehot : Tensor α (Shape.dim srcSeqLen (Shape.dim srcVocabSize Shape.scalar))) (tgt_onehot : Tensor α (Shape.dim tgtSeqLen (Shape.dim tgtVocabSize Shape.scalar))) (hSrc : srcSeqLen 0) (hTgt : tgtSeqLen 0) :
                                                          α

                                                          Per-timestep cross-entropy loss for the differentiable Seq2Seq baseline.

                                                          Computes:

                                                          1. logits via Seq2SeqSpec.forwardTrainingOnehot,
                                                          2. probabilities via softmax,
                                                          3. cross-entropy against the target token distribution at each timestep.

                                                          PyTorch analogue: nn.CrossEntropyLoss applied per timestep (with probabilities represented as one-hot).

                                                          Instances For
                                                            def Spec.Seq2SeqSpec.crossEntropyGradOnehot {α : Type} [Context α] {srcVocabSize tgtVocabSize embedDim hiddenDim srcSeqLen tgtSeqLen : } (model : Seq2SeqSpec α srcVocabSize tgtVocabSize embedDim hiddenDim) (src_onehot : Tensor α (Shape.dim srcSeqLen (Shape.dim srcVocabSize Shape.scalar))) (tgt_onehot : Tensor α (Shape.dim tgtSeqLen (Shape.dim tgtVocabSize Shape.scalar))) (_hSrc : srcSeqLen 0) (hTgt : tgtSeqLen 0) :
                                                            α × Seq2SeqGrads α srcVocabSize tgtVocabSize embedDim hiddenDim

                                                            Compute (loss, grads) for the Seq2Seq baseline under per-timestep cross-entropy.

                                                            This returns gradients for:

                                                            • both embedding tables,
                                                            • the encoder RNN,
                                                            • the decoder RNN,
                                                            • the decoder output projection,
                                                            • and decoder self-attention (if present).
                                                            Instances For
                                                              structure Spec.AttentionSeq2SeqSpec (α : Type) [Numbers α] (srcVocabSize tgtVocabSize embedDim hiddenDim : ) :

                                                              Attention-augmented Seq2Seq specification (simple encoder-output attention).

                                                              This record extends the baseline with an additional projection matrix used by the helper attention functions below (compute_attention_weights_spec / apply_attention_spec).

                                                              Note: this file includes these attention helpers as a building block; the main baseline forward passes above do not integrate encoder-decoder cross-attention by default.

                                                              • src_embedding : Seq2SeqEmbeddingSpec α srcVocabSize embedDim

                                                                Source embedding table + dropout configuration.

                                                              • tgt_embedding : Seq2SeqEmbeddingSpec α tgtVocabSize embedDim

                                                                Target embedding table + dropout configuration.

                                                              • encoder : Seq2SeqRNNEncoderSpec α embedDim hiddenDim

                                                                Encoder RNN parameters.

                                                              • decoder : Seq2SeqDecoderSpec α embedDim hiddenDim tgtVocabSize

                                                                Decoder parameters (RNN + output projection + optional self-attention).

                                                              • attention_weights : Tensor α (Shape.dim hiddenDim (Shape.dim hiddenDim Shape.scalar))

                                                                Attention projection matrix used to score encoder outputs against the decoder hidden state.

                                                              Instances For
                                                                def Spec.computeAttentionWeightsSpec {α : Type} [Context α] {hiddenDim seqLen : } (attention_weights : Tensor α (Shape.dim hiddenDim (Shape.dim hiddenDim Shape.scalar))) (decoder_hidden : Tensor α (Shape.dim hiddenDim Shape.scalar)) (encoder_outputs : Tensor α (Shape.dim seqLen (Shape.dim hiddenDim Shape.scalar))) (h1 : hiddenDim 0) (_h2 : seqLen 0) :

                                                                Compute attention weights over encoder outputs for a single decoder hidden state.

                                                                This is a simple dot-product style attention:

                                                                1. project the decoder hidden state (attention_weights · decoder_hidden),
                                                                2. score each encoder hidden vector by an elementwise product + sum,
                                                                3. normalize scores with softmax over the sequence axis.

                                                                It is inspired by classic encoder-decoder attention mechanisms (Bahdanau-style), but is intentionally kept compact in this spec.

                                                                Instances For
                                                                  def Spec.applyAttentionSpec {α : Type} [Context α] {hiddenDim seqLen : } (attention_weights : Tensor α (Shape.dim seqLen Shape.scalar)) (encoder_outputs : Tensor α (Shape.dim seqLen (Shape.dim hiddenDim Shape.scalar))) (h1 : seqLen 0) (_h2 : hiddenDim 0) :

                                                                  Apply attention weights to encoder outputs (weighted sum / context vector).

                                                                  Given attention weights a : (seqLen) and encoder outputs H : (seqLen × hiddenDim), returns the context vector c = Σ_i a_i · H_i : (hiddenDim).

                                                                  Instances For