Elman RNN Cell VJP #

This file proves the core differentiable cell used by a vanilla tanh RNN:

h' = tanh(W [x; h] + b).

The theorem is deliberately cell-level. Runtime sequence layers unroll this cell over time and scatter the hidden states into an output sequence; the full BPTT theorem is the induction over that unroll plus the existing gather/scatter adjoint facts. We prove the cell first because it is the right reusable grain size: vector concatenation, affine maps, and smooth elementwise tanh.

References:

Elman, "Finding Structure in Time", Cognitive Science 1990.
PyTorch torch.nn.RNN: https://pytorch.org/docs/stable/generated/torch.nn.RNN.html

source

@[reducible, inline]

abbrev Proofs.Autograd.Recurrent.XShape (inputSize : ℕ) :

Spec.Shape

Input vector shape for one RNN step.

Instances For

source

@[reducible, inline]

abbrev Proofs.Autograd.Recurrent.HShape (hiddenSize : ℕ) :

Spec.Shape

Hidden vector shape for one RNN step.

Instances For

source

@[reducible, inline]

abbrev Proofs.Autograd.Recurrent.ΓElman (inputSize hiddenSize : ℕ) :

List Spec.Shape

Context for a one-step Elman cell: current input and previous hidden state.

Instances For

source

@[reducible, inline]

abbrev Proofs.Autograd.Recurrent.ssElmanCell (inputSize hiddenSize : ℕ) :

List Spec.Shape

Saved tensors: concatenated [x; h], affine preactivation, and next hidden state.

Instances For

source

def Proofs.Autograd.Recurrent.idxInput {inputSize hiddenSize : ℕ} {ss : List Spec.Shape} :

Idx (ΓElman inputSize hiddenSize ++ ss) (XShape inputSize)

Current input index.

Instances For

source

def Proofs.Autograd.Recurrent.idxHidden {inputSize hiddenSize : ℕ} {ss : List Spec.Shape} :

Idx (ΓElman inputSize hiddenSize ++ ss) (HShape hiddenSize)

Previous hidden-state index.

Instances For

source

def Proofs.Autograd.Recurrent.idxLast {Γ ss : List Spec.Shape} {τ : Spec.Shape} :

Idx (Γ ++ ss ++ [τ]) τ

Most recently appended tensor helper.

Instances For

source

def Proofs.Autograd.Recurrent.idxConcat {inputSize hiddenSize : ℕ} :

Idx (ΓElman inputSize hiddenSize ++ [Spec.Shape.dim (inputSize + hiddenSize) Spec.Shape.scalar]) (Spec.Shape.dim (inputSize + hiddenSize) Spec.Shape.scalar)

Index of the concatenated [x; h] vector.

Instances For

source

def Proofs.Autograd.Recurrent.idxPre {inputSize hiddenSize : ℕ} :

Idx (ΓElman inputSize hiddenSize ++ [Spec.Shape.dim (inputSize + hiddenSize) Spec.Shape.scalar, HShape hiddenSize]) (HShape hiddenSize)

Index of the affine preactivation.

Instances For

source

noncomputable def Proofs.Autograd.Recurrent.elmanCellDGraph {inputSize hiddenSize : ℕ} (cell : Spec.LinearSpec ℝ (inputSize + hiddenSize) hiddenSize) :

DGraph (ΓElman inputSize hiddenSize) (ssElmanCell inputSize hiddenSize)

Proof-carrying graph for one Elman RNN cell.

The affine map is represented by a fixed LinearSpec, so this theorem covers the VJP with respect to the cell inputs (x, h). Parameter-gradient theorems are a separate layer over the trainable runtime parameter list.

Instances For

source

theorem Proofs.Autograd.Recurrent.elmanCell_backpropVec_eq_adjoint_fderiv {inputSize hiddenSize : ℕ} (cell : Spec.LinearSpec ℝ (inputSize + hiddenSize) hiddenSize) (xV : CtxVec (ΓElman inputSize hiddenSize)) (seedV : CtxVec (ΓElman inputSize hiddenSize ++ ssElmanCell inputSize hiddenSize)) :

(elmanCellDGraph cell).g.backpropVec xV seedV = (ContinuousLinearMap.adjoint (fderiv ℝ (elmanCellDGraph cell).g.evalVec xV)) seedV

End-to-end VJP theorem for one vanilla RNN cell.

This is the recurrent analogue of the attention block theorems: the graph-level reverse pass equals the adjoint of the Fréchet derivative of the cell evaluation function.

source

theorem Proofs.Autograd.Recurrent.elmanCell_eval_hasFDerivAt {inputSize hiddenSize : ℕ} (cell : Spec.LinearSpec ℝ (inputSize + hiddenSize) hiddenSize) (xV : CtxVec (ΓElman inputSize hiddenSize)) :

HasFDerivAt (elmanCellDGraph cell).g.evalVec (fderiv ℝ (elmanCellDGraph cell).g.evalVec xV) xV

Forward evaluation of one Elman cell is differentiable at every input context.

This is the recurrent analogue of the Transformer sublayer calculus bridges: it exposes the cell as a differentiable map that can be composed repeatedly when proving BPTT for an unrolled RNN.

source

theorem Proofs.Autograd.Recurrent.elmanTwoStep_hasFDerivAt {E : Type u} [NormedAddCommGroup E] [NormedSpace ℝ E] {inputSize hiddenSize : ℕ} (cell : Spec.LinearSpec ℝ (inputSize + hiddenSize) hiddenSize) (firstCtx : E → CtxVec (ΓElman inputSize hiddenSize)) (DfirstCtx : E →L[ℝ ] CtxVec (ΓElman inputSize hiddenSize)) (secondCtx : CtxVec (ΓElman inputSize hiddenSize ++ ssElmanCell inputSize hiddenSize) → CtxVec (ΓElman inputSize hiddenSize)) (DsecondCtx : CtxVec (ΓElman inputSize hiddenSize ++ ssElmanCell inputSize hiddenSize) →L[ℝ ] CtxVec (ΓElman inputSize hiddenSize)) (x : E) (hFirstCtx : HasFDerivAt firstCtx DfirstCtx x) (hSecondCtx : HasFDerivAt secondCtx DsecondCtx ((elmanCellDGraph cell).g.evalVec (firstCtx x))) :

HasFDerivAt (fun (z : E) => (elmanCellDGraph cell).g.evalVec (secondCtx ((elmanCellDGraph cell).g.evalVec (firstCtx z)))) ((fderiv ℝ (elmanCellDGraph cell).g.evalVec (secondCtx ((elmanCellDGraph cell).g.evalVec (firstCtx x)))).comp (DsecondCtx.comp ((fderiv ℝ (elmanCellDGraph cell).g.evalVec (firstCtx x)).comp DfirstCtx))) x

Two-step recurrent composition bridge.

Suppose firstCtx builds the first cell context from some outer state E, and secondCtx builds the next cell context from the evaluated first-cell graph (for example by selecting the next input and the hidden state produced by the first step). If both context builders are differentiable, then the two-cell unroll is differentiable.

The theorem is intentionally abstract over secondCtx: different sequence layouts store inputs, hidden states, and caches differently, but every vanilla RNN BPTT proof follows this same chain-rule shape.

TorchLean API

NN.Proofs.Autograd.Tape.Ops.Recurrent.ElmanCell

Elman RNN Cell VJP #