Core Reinforcement-Learning Definitions #

This module collects the small mathematical definitions that sit underneath TorchLean's RL development.

These definitions are intentionally spec-level rather than runtime-level:

Bellman-style backups,
discounted returns,
generalized advantage estimation (GAE),
and simple typed rollout records.

That keeps the actual RL mathematics in a proof-friendly namespace and avoids duplicating it inside runtime/trainer code.

Why Lists (Not Tensors)? #

Several helpers here operate on List α rather than Tensor α (.dim n .scalar) on purpose.

A trajectory length is usually data-dependent (episode termination, truncation, variable rollout horizon), so a dependent tensor length is often the wrong abstraction.
TorchLean uses typed tensors heavily for fixed-shape objects (value tables, Q-tables, logits, etc.). For variable-length traces, List is the lightweight, proof-friendly choice.

When you do have a fixed horizon n, it is reasonable to use Fin n → α or a vector tensor and define specialized “returns/GAE” helpers on top. We keep the core definitions here compact and general, and add fixed-horizon variants where they meaningfully improve downstream code.

Primary references:

Sutton, "Learning to Predict by the Methods of Temporal Differences" (1988): https://doi.org/10.1023/A:1022633531479
Watkins and Dayan, "Q-learning" (1992): https://doi.org/10.1007/BF00992698
Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed.): http://incompleteideas.net/book/the-book-2nd.html
Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
TorchRL documentation (rollouts, tensordicts, and GAE-style objectives): https://pytorch.org/rl/

source

structure Spec.RL.AdvantageStep (α : Type) :

Type

Small record used by generalized-advantage-estimation helpers.

reward : α
Immediate reward r_t.
value : α
Baseline / critic value estimate V(s_t).
nextValue : α
Bootstrap value V(s_{t+1}).
done : Bool
Episode termination flag.

Instances For

source

def Spec.RL.continueMask {α : Type} [Zero α] [One α] (done : Bool) :

Convert a terminal flag into a multiplicative continuation mask (1 for continue, 0 for stop).

Instances For

source

def Spec.RL.discountedBackup {α : Type} [Zero α] [One α] [Add α] [Mul α] (reward gamma bootstrap : α) (done : Bool) :

Bellman-style one-step backup: r + γ * (1 - done) * bootstrap.

Instances For

source

def Spec.RL.tdTarget {α : Type} [Zero α] [One α] [Add α] [Mul α] (reward gamma nextValue : α) (done : Bool) :

One-step TD target for state-value or action-value updates.

Instances For

source

def Spec.RL.tdResidual {α : Type} [Zero α] [One α] [Add α] [Mul α] [Sub α] (value reward gamma nextValue : α) (done : Bool) :

TD residual / Bellman error: r + γ * (1-d) * nextValue - value.

Instances For

source

def Spec.RL.discountedReturnsFrom {α : Type} [Zero α] [Add α] [Mul α] (gamma : α) (rewards : List α) (bootstrap : α := 0) :

List α

Discounted returns with a bootstrap value on the far right: G_t = r_t + γ G_{t+1}.

Instances For

source

def Spec.RL.discountedReturns {α : Type} [Zero α] [Add α] [Mul α] (gamma : α) (rewards : List α) :

List α

Discounted returns for a terminal trajectory (bootstrap defaults to 0).

Instances For

source

def Spec.RL.discountedReturnsDone {α : Type} [Zero α] [One α] [Add α] [Mul α] (gamma : α) (rewards : List α) (dones : List Bool) (bootstrap : α := 0) :

List α

Discounted returns with explicit termination markers.

When done = true, the future return is reset before bootstrapping the current reward.

Instances For

source

def Spec.RL.generalizedAdvantageEstimation {α : Type} [Zero α] [One α] [Add α] [Mul α] [Sub α] (gamma lam : α) (steps : List (AdvantageStep α)) :

List α

Generalized Advantage Estimation (GAE).

Each input step provides r_t, V(s_t), V(s_{t+1}), and done_t. The resulting list contains advantages in forward time order.

Instances For

source

def Spec.RL.returnsFromAdvantages {α : Type} [Add α] :

List α → List α → List α

Recover lambda-returns from advantages and baseline values via R_t = A_t + V(s_t).

Instances For

TorchLean API

NN.Spec.RL.Core

Core Reinforcement-Learning Definitions #

Why Lists (Not Tensors)? #