LSTM models (spec) #

Higher‑level LSTM architectures built from module specs (SpecChain), including:

sequence‑to‑sequence outputs,
classifier heads (many‑to‑one),
multi‑layer compositions.

Cell equations are in NN/Spec/Layers/Lstm.lean; this file focuses on composing modules.

References (math + PyTorch behavior):

Hochreiter and Schmidhuber (1997), "Long Short-Term Memory" (original LSTM): https://www.bioinf.jku.at/publications/older/2604.pdf
PyTorch nn.LSTM docs: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
PyTorch nn.LSTMCell docs: https://pytorch.org/docs/stable/generated/torch.nn.LSTMCell.html

“Fully implemented” LSTM models #

The LSTM model layer exposes first-class model objects with:

a forward pass,
a standard training objective, and
an explicit reverse-mode / BPTT backward pass producing parameter gradients.

This section provides those “full” APIs for the simple LSTM models in this file by reusing the gate-aware BPTT implementation in NN/Spec/Layers/Lstm.lean.

Gradient records #

source

structure Spec.LinearGrads (α : Type) (inDim outDim : ℕ) :

Type

Gradients for a linear layer y = W x + b.

This is the natural gradient bundle for Spec.LinearSpec (PyTorch analogue: torch.nn.Linear), with dW matching the weight shape [outDim, inDim] and db matching [outDim].

dW : Tensor α (Shape.dim outDim (Shape.dim inDim Shape.scalar))
d W.
db : Tensor α (Shape.dim outDim Shape.scalar)
db.

Instances For

source

structure Spec.LSTMGrads (α : Type) (inputSize hiddenSize : ℕ) :

Type

Gate-wise gradients for an LSTM cell.

This matches the parameterization used by Spec.LSTMSpec (see NN/Spec/Layers/Lstm.lean): each gate has a weight matrix of shape [hiddenSize, inputSize + hiddenSize] applied to a concatenated vector, plus a bias of shape [hiddenSize].

d_forget_weights : WeightMatrix α hiddenSize (inputSize + hiddenSize)
d forget weights.
d_forget_bias : HiddenVector α hiddenSize
d forget bias.
d_input_weights : WeightMatrix α hiddenSize (inputSize + hiddenSize)
d input weights.
d_input_bias : HiddenVector α hiddenSize
d input bias.
d_candidate_weights : WeightMatrix α hiddenSize (inputSize + hiddenSize)
d candidate weights.
d_candidate_bias : HiddenVector α hiddenSize
d candidate bias.
d_output_weights : WeightMatrix α hiddenSize (inputSize + hiddenSize)
d output weights.
d_output_bias : HiddenVector α hiddenSize
d output bias.

Instances For

source

structure Spec.SimpleLSTMModelGrads (α : Type) (inputSize hiddenSize outputSize : ℕ) :

Type

Parameter gradients for SimpleLSTMModel.

This bundles the LSTM cell gradients and the time-distributed linear head gradients.

lstm : LSTMGrads α inputSize hiddenSize
lstm.
output_layer : LinearGrads α hiddenSize outputSize
output layer.

Instances For

source

def Spec.simpleLstmModelSpec {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {seqLen inputSize hiddenSize outputSize : ℕ} (lstm_spec : LSTMSpec α inputSize hiddenSize) (linearSpec : LinearSpec α hiddenSize outputSize) :

ModSpec.SpecChain α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar)) (Shape.dim seqLen (Shape.dim outputSize Shape.scalar))

Sequence-to-sequence LSTM model as a SpecChain: LSTM over time, then a per-timestep linear head.

PyTorch analogue: nn.LSTM producing an output sequence, followed by nn.Linear applied at each time step.

Instances For

source

def Spec.lstmClassifierModelSpec {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {seqLen inputSize hiddenSize numClasses : ℕ} (lstm_spec : LSTMSpec α inputSize hiddenSize) (classifier_spec : LinearSpec α hiddenSize numClasses) (h : seqLen ≠ 0) :

ModSpec.SpecChain α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar)) (Shape.dim numClasses Shape.scalar)

Many-to-one LSTM classifier as a SpecChain.

This runs an LSTM over the sequence and applies a linear classifier head to the final hidden state. PyTorch analogue: nn.LSTM + nn.Linear, taking the last output/hidden.

Instances For

source

def Spec.multilayerLstmSpec {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {seqLen inputSize hiddenSize outputSize : ℕ} (lstm1_spec : LSTMSpec α inputSize hiddenSize) (lstm2_spec : LSTMSpec α hiddenSize hiddenSize) (linearSpec : LinearSpec α hiddenSize outputSize) :

ModSpec.SpecChain α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar)) (Shape.dim seqLen (Shape.dim outputSize Shape.scalar))

Two-layer LSTM stack (sequence-to-sequence), followed by a per-timestep linear head.

The second LSTM consumes the hidden stream produced by the first.

Instances For

source

def Spec.lstmLanguageModelSpec {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {seqLen vocabSize hiddenSize : ℕ} (embedding_spec : LinearSpec α vocabSize hiddenSize) (lstm_spec : LSTMSpec α hiddenSize hiddenSize) (output_spec : LinearSpec α hiddenSize vocabSize) :

ModSpec.SpecChain α (Shape.dim seqLen (Shape.dim vocabSize Shape.scalar)) (Shape.dim seqLen (Shape.dim vocabSize Shape.scalar))

Simple LSTM language-model pipeline as a SpecChain: embedding, LSTM core, and output projection.

In this spec layer we represent the embedding/projection as LinearSpecs (often used with one-hot token vectors). PyTorch analogue: nn.Embedding (conceptually) + nn.LSTM + nn.Linear.

Instances For

source

structure Spec.SimpleLSTMModel (α : Type) (inputSize hiddenSize outputSize : ℕ) :

Type

Bundle of parameters for a single-layer LSTM model with a linear output head.

This is a direct record representation (as opposed to the SpecChain representation above).

lstm : LSTMSpec α inputSize hiddenSize
lstm.
output_layer : LinearSpec α hiddenSize outputSize
output layer.

Instances For

source

structure Spec.MultiLayerLSTMModel (α : Type) (inputSize hiddenSize outputSize numLayers : ℕ) :

Type

Bundle of parameters for a multi-layer LSTM model with a linear output head.

The first layer consumes inputSize, and all subsequent layers consume hiddenSize.

first_layer : LSTMSpec α inputSize hiddenSize
first layer.
hidden_layers (i : Fin (numLayers - 1)) : LSTMSpec α hiddenSize hiddenSize
hidden layers.
output_layer : LinearSpec α hiddenSize outputSize
output layer.

Instances For

source

structure Spec.LSTMClassifier (α : Type) (inputSize hiddenSize numClasses : ℕ) :

Type

Bundle of parameters for a many-to-one LSTM classifier.

The classifier head is applied to the final hidden state.

lstm : LSTMSpec α inputSize hiddenSize
lstm.
classifier : LinearSpec α hiddenSize numClasses
classifier.

Instances For

source

structure Spec.LSTMGenerator (α : Type) (vocabSize hiddenSize : ℕ) :

Type

Bundle of parameters for a many-to-many LSTM generator (language-model style).

This includes an (embedding) linear map, recurrent core, and output projection back to vocabulary.

embedding : LinearSpec α vocabSize hiddenSize
embedding.
lstm : LSTMSpec α hiddenSize hiddenSize
lstm.
output_projection : LinearSpec α hiddenSize vocabSize
output projection.

Instances For

source

structure Spec.BiLSTMModel (α : Type) (inputSize hiddenSize outputSize : ℕ) :

Type

Bundle of parameters for a bidirectional LSTM model with an output head.

The head consumes the concatenation of forward and backward hidden states. PyTorch analogue: nn.LSTM(..., bidirectional=true) plus a linear projection.

forward_lstm : LSTMSpec α inputSize hiddenSize
forward lstm.
backward_lstm : LSTMSpec α inputSize hiddenSize
backward lstm.
output_layer : LinearSpec α (hiddenSize + hiddenSize) outputSize
output layer.

Instances For

source

structure Spec.LSTMLanguageModel (α : Type) (vocabSize hiddenSize : ℕ) :

Type

Bundle of parameters for a stacked LSTM language model with deterministic dropout.

This model uses a list of LSTM layers (all with hiddenSize input/output) and applies a dropout_inference_spec scaling step between the recurrent stack and the output projection.

embedding : LinearSpec α vocabSize hiddenSize
embedding.
lstm_layers : List (LSTMSpec α hiddenSize hiddenSize)
lstm layers.
output_projection : LinearSpec α hiddenSize vocabSize
output projection.
dropout_rate : α
dropout rate.

Instances For

source

def Spec.simpleLstmForward {α : Type} [Context α] {inputSize hiddenSize outputSize : ℕ} (model : SimpleLSTMModel α inputSize hiddenSize outputSize) (input : Tensor α (Shape.dim inputSize Shape.scalar)) (state : LSTMState α hiddenSize) :

Tensor α (Shape.dim outputSize Shape.scalar) × LSTMState α hiddenSize

One-step forward pass for SimpleLSTMModel.

Given an input vector and the previous LSTM state (hidden, cell), compute (output, new_state). PyTorch analogue: nn.LSTMCell step followed by a nn.Linear head.

Instances For

source

def Spec.simpleLstmSequenceForward {α : Type} [Context α] {seqLen inputSize hiddenSize outputSize : ℕ} (model : SimpleLSTMModel α inputSize hiddenSize outputSize) (inputs : Tensor α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar))) (initial_state : LSTMState α hiddenSize) :

Tensor α (Shape.dim seqLen (Shape.dim outputSize Shape.scalar)) × LSTMState α hiddenSize

Sequence forward pass for SimpleLSTMModel.

Runs the LSTM over all timesteps (time-major), applies the output head to each hidden state, and returns (outputs, final_state).

Instances For

Backward pass (BPTT) for the simple LSTM sequence model #

This is the model-level analogue of Spec.lstm_sequence_backward_spec. The only extra work we do here is to backprop through the per-timestep output projection and feed its gradient into the LSTM sequence backward pass.

source

def Spec.LSTMModels.Internal.timeDistributedLinearBackward {α : Type} [Context α] {seqLen hiddenSize outputSize : ℕ} (layer : LinearSpec α hiddenSize outputSize) (hiddens : Tensor α (Shape.dim seqLen (Shape.dim hiddenSize Shape.scalar))) (grad_outputs : Tensor α (Shape.dim seqLen (Shape.dim outputSize Shape.scalar))) :

LinearGrads α hiddenSize outputSize × Tensor α (Shape.dim seqLen (Shape.dim hiddenSize Shape.scalar))

Backprop through the time-distributed linear head and produce hidden-state gradients.

Instances For

source

def Spec.simpleLstmBackward {α : Type} [Context α] {seqLen inputSize hiddenSize outputSize : ℕ} (model : SimpleLSTMModel α inputSize hiddenSize outputSize) (inputs : Tensor α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar))) (initial_state : LSTMState α hiddenSize) (grad_outputs : Tensor α (Shape.dim seqLen (Shape.dim outputSize Shape.scalar))) :

SimpleLSTMModelGrads α inputSize hiddenSize outputSize × Tensor α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar)) × LSTMState α hiddenSize

Backward pass for simple_lstm_sequence_forward.

Returns:

parameter gradients (SimpleLSTMModelGrads)
gradient w.r.t. input sequence (dInputs)
gradient w.r.t. initial recurrent state (dInitialState)

Instances For

source

def Spec.simpleLstmMseLoss {α : Type} [Context α] {seqLen inputSize hiddenSize outputSize : ℕ} (model : SimpleLSTMModel α inputSize hiddenSize outputSize) (inputs : Tensor α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar))) (targets : Tensor α (Shape.dim seqLen (Shape.dim outputSize Shape.scalar))) (initial_state : LSTMState α hiddenSize) :

MSE loss for the simple LSTM sequence model.

This runs simple_lstm_sequence_forward and compares the predicted output sequence against targets using mse_spec.

Instances For

source

def Spec.simpleLstmMseGrad {α : Type} [Context α] {seqLen inputSize hiddenSize outputSize : ℕ} (model : SimpleLSTMModel α inputSize hiddenSize outputSize) (inputs : Tensor α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar))) (targets : Tensor α (Shape.dim seqLen (Shape.dim outputSize Shape.scalar))) (initial_state : LSTMState α hiddenSize) :

α × SimpleLSTMModelGrads α inputSize hiddenSize outputSize

Compute (loss, grads) for the simple LSTM sequence model under MSE.

This is the “full training API” building block: an optimizer (SGD/Adam) can consume these grads.

Instances For

source

def Spec.lstmClassifierForward {α : Type} [Context α] {seqLen inputSize hiddenSize numClasses : ℕ} (model : LSTMClassifier α inputSize hiddenSize numClasses) (inputs : Tensor α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar))) (initial_state : LSTMState α hiddenSize) :

Tensor α (Shape.dim numClasses Shape.scalar)

Forward pass for an LSTMClassifier (many-to-one).

This uses the final hidden state of the LSTM sequence as the classifier input.

Instances For

Backward for the classifier head (many-to-one) #

The classifier only consumes the final hidden state. We express that by feeding a gradient sequence that is zero everywhere except the last timestep.

source

def Spec.lstmClassifierBackward {α : Type} [Context α] {seqLen inputSize hiddenSize numClasses : ℕ} (model : LSTMClassifier α inputSize hiddenSize numClasses) (inputs : Tensor α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar))) (initial_state : LSTMState α hiddenSize) (grad_logits : Tensor α (Shape.dim numClasses Shape.scalar)) :

LinearGrads α hiddenSize numClasses × LSTMGrads α inputSize hiddenSize × Tensor α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar))

Backward pass for an LSTMClassifier (many-to-one).

This backprops through the classifier head, then runs an LSTM sequence backward pass where the hidden-state gradient is zero at all timesteps except the last.

Instances For

source

def Spec.lstmGeneratorForward {α : Type} [Context α] {seqLen vocabSize hiddenSize : ℕ} (model : LSTMGenerator α vocabSize hiddenSize) (input_tokens : Tensor α (Shape.dim seqLen (Shape.dim vocabSize Shape.scalar))) (initial_state : LSTMState α hiddenSize) :

Tensor α (Shape.dim seqLen (Shape.dim vocabSize Shape.scalar)) × LSTMState α hiddenSize

Forward pass for an LSTMGenerator (many-to-many).

This applies an embedding linear map to each token vector, runs the LSTM, and projects each hidden state back into vocabulary space.

Instances For

source

def Spec.bilstmForward {α : Type} [Context α] {seqLen inputSize hiddenSize outputSize : ℕ} (model : BiLSTMModel α inputSize hiddenSize outputSize) (inputs : Tensor α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar))) (forward_state backward_state : LSTMState α hiddenSize) :

Tensor α (Shape.dim seqLen (Shape.dim outputSize Shape.scalar))

Forward pass for a bidirectional LSTM model (time-major).

This runs a forward LSTM on the sequence, a backward LSTM on the reversed sequence, concatenates the two hidden streams per timestep, and applies an output head.

Instances For

source

def Spec.multilayerLstmForward {α : Type} [Context α] {seqLen inputSize hiddenSize outputSize numLayers : ℕ} (model : MultiLayerLSTMModel α inputSize hiddenSize outputSize numLayers) (inputs : Tensor α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar))) (initial_states : Fin numLayers → LSTMState α hiddenSize) (h : numLayers > 0) :

Tensor α (Shape.dim seqLen (Shape.dim outputSize Shape.scalar)) × (Fin numLayers → LSTMState α hiddenSize)

Forward pass for a MultiLayerLSTMModel (stacked LSTM layers).

This runs the first layer on the input sequence, then threads the resulting hidden stream through each additional hidden layer, and finally applies the output head per timestep.

Instances For

source

@[irreducible]

def Spec.multilayerLstmForward.process_hidden_layers {α : Type} [Context α] {seqLen inputSize hiddenSize outputSize numLayers : ℕ} (model : MultiLayerLSTMModel α inputSize hiddenSize outputSize numLayers) (h : numLayers > 0) (layer : ℕ) (layer_input : Tensor α (Shape.dim seqLen (Shape.dim hiddenSize Shape.scalar))) (states : Fin numLayers → LSTMState α hiddenSize) :

Tensor α (Shape.dim seqLen (Shape.dim hiddenSize Shape.scalar)) × (Fin numLayers → LSTMState α hiddenSize)

Instances For

source

def Spec.lstmLmForward {α : Type} [Context α] {seqLen vocabSize hiddenSize : ℕ} (model : LSTMLanguageModel α vocabSize hiddenSize) (input_tokens : Tensor α (Shape.dim seqLen (Shape.dim vocabSize Shape.scalar))) (initial_states : List (LSTMState α hiddenSize)) :

Tensor α (Shape.dim seqLen (Shape.dim vocabSize Shape.scalar)) × List (LSTMState α hiddenSize)

Forward pass for LSTMLanguageModel (teacher forcing, time-major).

This runs the embedding, then a stack of LSTM layers with provided initial states, applies deterministic dropout scaling (dropout_inference_spec), and projects to vocabulary logits.

Instances For

source

def Spec.lstmLmForward.process_lstm_layers {α : Type} [Context α] {seqLen hiddenSize : ℕ} (layers : List (LSTMSpec α hiddenSize hiddenSize)) (states : List (LSTMState α hiddenSize)) (layer_input : Tensor α (Shape.dim seqLen (Shape.dim hiddenSize Shape.scalar))) :

Tensor α (Shape.dim seqLen (Shape.dim hiddenSize Shape.scalar)) × List (LSTMState α hiddenSize)

Instances For

source

structure Spec.AttentionLSTMModel (α : Type) (inputSize hiddenSize outputSize : ℕ) :

Type

Attention-style LSTM model bundle.

This record defines the parameters for an encoder/decoder LSTM with learned attention scores. Forward passes can choose additive, dot-product, or domain-specific attention semantics while sharing this typed parameter bundle.

encoder_lstm : LSTMSpec α inputSize hiddenSize
encoder lstm.
decoder_lstm : LSTMSpec α (inputSize + hiddenSize) hiddenSize
decoder lstm.
attention_weights : LinearSpec α (hiddenSize + hiddenSize) 1
attention weights.
output_layer : LinearSpec α hiddenSize outputSize
output layer.

Instances For

source

def Spec.simpleLSTMToModuleSpec {α : Type} [Context α] {seqLen inputSize hiddenSize outputSize : ℕ} (model : SimpleLSTMModel α inputSize hiddenSize outputSize) :

ModSpec.NNModuleSpec α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar)) (Shape.dim seqLen (Shape.dim outputSize Shape.scalar))

Package SimpleLSTMModel as an NNModuleSpec.

This is used to plug the spec model into the common module pipeline. The export_func.toPyTorch field is documentation-oriented and indicates the intended PyTorch analogue.

Instances For

source

def Spec.lstmClassifierToModuleSpec {α : Type} [Context α] {seqLen inputSize hiddenSize numClasses : ℕ} (model : LSTMClassifier α inputSize hiddenSize numClasses) :

ModSpec.NNModuleSpec α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar)) (Shape.dim numClasses Shape.scalar)

Package LSTMClassifier as an NNModuleSpec.

PyTorch analogue: nn.LSTM feeding a nn.Linear classifier head.

Instances For

source

def Spec.biLSTMToModuleSpec {α : Type} [Context α] {seqLen inputSize hiddenSize outputSize : ℕ} (model : BiLSTMModel α inputSize hiddenSize outputSize) :

ModSpec.NNModuleSpec α (Shape.dim seqLen (Shape.dim inputSize Shape.scalar)) (Shape.dim seqLen (Shape.dim outputSize Shape.scalar))

Package BiLSTMModel as an NNModuleSpec.

PyTorch analogue: nn.LSTM(..., bidirectional=true) feeding a per-timestep linear head.

Instances For

TorchLean API

NN.Spec.Module.LstmModels

LSTM models (spec) #

“Fully implemented” LSTM models #

Gradient records #

Backward pass (BPTT) for the simple LSTM sequence model #

Backward for the classifier head (many-to-one) #