Core Reinforcement-Learning Definitions #
This module collects the small mathematical definitions that sit underneath TorchLean's RL development.
These definitions are intentionally spec-level rather than runtime-level:
- Bellman-style backups,
- discounted returns,
- generalized advantage estimation (GAE),
- and simple typed rollout records.
That keeps the actual RL mathematics in a proof-friendly namespace and avoids duplicating it inside runtime/trainer code.
Why Lists (Not Tensors)? #
Several helpers here operate on List α rather than Tensor α (.dim n .scalar) on purpose.
- A trajectory length is usually data-dependent (episode termination, truncation, variable rollout horizon), so a dependent tensor length is often the wrong abstraction.
- TorchLean uses typed tensors heavily for fixed-shape objects (value tables, Q-tables, logits,
etc.). For variable-length traces,
Listis the lightweight, proof-friendly choice.
When you do have a fixed horizon n, it is reasonable to use Fin n → α or a vector tensor and
define specialized “returns/GAE” helpers on top. We keep the core definitions here compact and
general, and add fixed-horizon variants where they meaningfully improve downstream code.
Primary references:
- Sutton, "Learning to Predict by the Methods of Temporal Differences" (1988): https://doi.org/10.1023/A:1022633531479
- Watkins and Dayan, "Q-learning" (1992): https://doi.org/10.1007/BF00992698
- Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed.): http://incompleteideas.net/book/the-book-2nd.html
- Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
- TorchRL documentation (rollouts, tensordicts, and GAE-style objectives): https://pytorch.org/rl/
Small record used by generalized-advantage-estimation helpers.
- reward : α
Immediate reward
r_t. - value : α
Baseline / critic value estimate
V(s_t). - nextValue : α
Bootstrap value
V(s_{t+1}). - done : Bool
Episode termination flag.
Instances For
Discounted returns with explicit termination markers.
When done = true, the future return is reset before bootstrapping the current reward.
Instances For
Generalized Advantage Estimation (GAE).
Each input step provides r_t, V(s_t), V(s_{t+1}), and done_t. The resulting list contains
advantages in forward time order.