LSTM models (spec) #
Higher‑level LSTM architectures built from module specs (SpecChain), including:
- sequence‑to‑sequence outputs,
- classifier heads (many‑to‑one),
- multi‑layer compositions.
Cell equations are in NN/Spec/Layers/Lstm.lean; this file focuses on composing modules.
References (math + PyTorch behavior):
- Hochreiter and Schmidhuber (1997), "Long Short-Term Memory" (original LSTM): https://www.bioinf.jku.at/publications/older/2604.pdf
- PyTorch
nn.LSTMdocs: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html - PyTorch
nn.LSTMCelldocs: https://pytorch.org/docs/stable/generated/torch.nn.LSTMCell.html
“Fully implemented” LSTM models #
The LSTM model layer exposes first-class model objects with:
- a forward pass,
- a standard training objective, and
- an explicit reverse-mode / BPTT backward pass producing parameter gradients.
This section provides those “full” APIs for the simple LSTM models in this file by reusing the
gate-aware BPTT implementation in NN/Spec/Layers/Lstm.lean.
Gradient records #
Gradients for a linear layer y = W x + b.
This is the natural gradient bundle for Spec.LinearSpec (PyTorch analogue: torch.nn.Linear),
with dW matching the weight shape [outDim, inDim] and db matching [outDim].
- dW : Tensor α (Shape.dim outDim (Shape.dim inDim Shape.scalar))
d W.
- db : Tensor α (Shape.dim outDim Shape.scalar)
db.
Instances For
Gate-wise gradients for an LSTM cell.
This matches the parameterization used by Spec.LSTMSpec (see NN/Spec/Layers/Lstm.lean): each
gate has a weight matrix of shape [hiddenSize, inputSize + hiddenSize] applied to a concatenated
vector, plus a bias of shape [hiddenSize].
- d_forget_weights : WeightMatrix α hiddenSize (inputSize + hiddenSize)
d forget weights.
- d_forget_bias : HiddenVector α hiddenSize
d forget bias.
- d_input_weights : WeightMatrix α hiddenSize (inputSize + hiddenSize)
d input weights.
- d_input_bias : HiddenVector α hiddenSize
d input bias.
- d_candidate_weights : WeightMatrix α hiddenSize (inputSize + hiddenSize)
d candidate weights.
- d_candidate_bias : HiddenVector α hiddenSize
d candidate bias.
- d_output_weights : WeightMatrix α hiddenSize (inputSize + hiddenSize)
d output weights.
- d_output_bias : HiddenVector α hiddenSize
d output bias.
Instances For
Parameter gradients for SimpleLSTMModel.
This bundles the LSTM cell gradients and the time-distributed linear head gradients.
- lstm : LSTMGrads α inputSize hiddenSize
lstm.
- output_layer : LinearGrads α hiddenSize outputSize
output layer.
Instances For
Sequence-to-sequence LSTM model as a SpecChain: LSTM over time, then a per-timestep linear head.
PyTorch analogue: nn.LSTM producing an output sequence, followed by nn.Linear applied at each
time step.
Instances For
Many-to-one LSTM classifier as a SpecChain.
This runs an LSTM over the sequence and applies a linear classifier head to the final hidden state.
PyTorch analogue: nn.LSTM + nn.Linear, taking the last output/hidden.
Instances For
Two-layer LSTM stack (sequence-to-sequence), followed by a per-timestep linear head.
The second LSTM consumes the hidden stream produced by the first.
Instances For
Simple LSTM language-model pipeline as a SpecChain: embedding, LSTM core, and output projection.
In this spec layer we represent the embedding/projection as LinearSpecs (often used with one-hot
token vectors). PyTorch analogue: nn.Embedding (conceptually) + nn.LSTM + nn.Linear.
Instances For
Bundle of parameters for a single-layer LSTM model with a linear output head.
This is a direct record representation (as opposed to the SpecChain representation above).
- lstm : LSTMSpec α inputSize hiddenSize
lstm.
- output_layer : LinearSpec α hiddenSize outputSize
output layer.
Instances For
Bundle of parameters for a multi-layer LSTM model with a linear output head.
The first layer consumes inputSize, and all subsequent layers consume hiddenSize.
- first_layer : LSTMSpec α inputSize hiddenSize
first layer.
- output_layer : LinearSpec α hiddenSize outputSize
output layer.
Instances For
Bundle of parameters for a many-to-one LSTM classifier.
The classifier head is applied to the final hidden state.
- lstm : LSTMSpec α inputSize hiddenSize
lstm.
- classifier : LinearSpec α hiddenSize numClasses
classifier.
Instances For
Bundle of parameters for a many-to-many LSTM generator (language-model style).
This includes an (embedding) linear map, recurrent core, and output projection back to vocabulary.
- embedding : LinearSpec α vocabSize hiddenSize
embedding.
- lstm : LSTMSpec α hiddenSize hiddenSize
lstm.
- output_projection : LinearSpec α hiddenSize vocabSize
output projection.
Instances For
Bundle of parameters for a bidirectional LSTM model with an output head.
The head consumes the concatenation of forward and backward hidden states.
PyTorch analogue: nn.LSTM(..., bidirectional=true) plus a linear projection.
- forward_lstm : LSTMSpec α inputSize hiddenSize
forward lstm.
- backward_lstm : LSTMSpec α inputSize hiddenSize
backward lstm.
- output_layer : LinearSpec α (hiddenSize + hiddenSize) outputSize
output layer.
Instances For
Bundle of parameters for a stacked LSTM language model with deterministic dropout.
This model uses a list of LSTM layers (all with hiddenSize input/output) and applies a
dropout_inference_spec scaling step between the recurrent stack and the output projection.
- embedding : LinearSpec α vocabSize hiddenSize
embedding.
lstm layers.
- output_projection : LinearSpec α hiddenSize vocabSize
output projection.
- dropout_rate : α
dropout rate.
Instances For
One-step forward pass for SimpleLSTMModel.
Given an input vector and the previous LSTM state (hidden, cell), compute (output, new_state).
PyTorch analogue: nn.LSTMCell step followed by a nn.Linear head.
Instances For
Sequence forward pass for SimpleLSTMModel.
Runs the LSTM over all timesteps (time-major), applies the output head to each hidden state, and
returns (outputs, final_state).
Instances For
Backward pass (BPTT) for the simple LSTM sequence model #
This is the model-level analogue of Spec.lstm_sequence_backward_spec. The only extra work we do
here is to backprop through the per-timestep output projection and feed its gradient into the LSTM
sequence backward pass.
Backprop through the time-distributed linear head and produce hidden-state gradients.
Instances For
Backward pass for simple_lstm_sequence_forward.
Returns:
- parameter gradients (
SimpleLSTMModelGrads) - gradient w.r.t. input sequence (
dInputs) - gradient w.r.t. initial recurrent state (
dInitialState)
Instances For
MSE loss for the simple LSTM sequence model.
This runs simple_lstm_sequence_forward and compares the predicted output sequence against
targets using mse_spec.
Instances For
Compute (loss, grads) for the simple LSTM sequence model under MSE.
This is the “full training API” building block: an optimizer (SGD/Adam) can consume these grads.
Instances For
Forward pass for an LSTMClassifier (many-to-one).
This uses the final hidden state of the LSTM sequence as the classifier input.
Instances For
Backward for the classifier head (many-to-one) #
The classifier only consumes the final hidden state. We express that by feeding a gradient sequence that is zero everywhere except the last timestep.
Backward pass for an LSTMClassifier (many-to-one).
This backprops through the classifier head, then runs an LSTM sequence backward pass where the hidden-state gradient is zero at all timesteps except the last.
Instances For
Forward pass for an LSTMGenerator (many-to-many).
This applies an embedding linear map to each token vector, runs the LSTM, and projects each hidden state back into vocabulary space.
Instances For
Forward pass for a bidirectional LSTM model (time-major).
This runs a forward LSTM on the sequence, a backward LSTM on the reversed sequence, concatenates the two hidden streams per timestep, and applies an output head.
Instances For
Forward pass for a MultiLayerLSTMModel (stacked LSTM layers).
This runs the first layer on the input sequence, then threads the resulting hidden stream through each additional hidden layer, and finally applies the output head per timestep.
Instances For
Forward pass for LSTMLanguageModel (teacher forcing, time-major).
This runs the embedding, then a stack of LSTM layers with provided initial states, applies
deterministic dropout scaling (dropout_inference_spec), and projects to vocabulary logits.
Instances For
Attention-style LSTM model bundle.
This record defines the parameters for an encoder/decoder LSTM with learned attention scores. Forward passes can choose additive, dot-product, or domain-specific attention semantics while sharing this typed parameter bundle.
- encoder_lstm : LSTMSpec α inputSize hiddenSize
encoder lstm.
decoder lstm.
- attention_weights : LinearSpec α (hiddenSize + hiddenSize) 1
attention weights.
- output_layer : LinearSpec α hiddenSize outputSize
output layer.
Instances For
Package SimpleLSTMModel as an NNModuleSpec.
This is used to plug the spec model into the common module pipeline. The export_func.toPyTorch
field is documentation-oriented and indicates the intended PyTorch analogue.
Instances For
Package LSTMClassifier as an NNModuleSpec.
PyTorch analogue: nn.LSTM feeding a nn.Linear classifier head.
Instances For
Package BiLSTMModel as an NNModuleSpec.
PyTorch analogue: nn.LSTM(..., bidirectional=true) feeding a per-timestep linear head.