Core Reinforcement-Learning Runtime Helpers #
This module adds the tensor-shaped and runtime-facing pieces that sit on top of the mathematical
RL core in NN.Spec.RL.Core.
Keeping Bellman / return / GAE definitions in the spec layer avoids an awkward split where the same mathematics would otherwise exist in both runtime and proof namespaces. This file therefore only keeps:
- typed transition records for tensor-valued or indexed rollouts,
- small action-encoding helpers,
- and scalar losses commonly used by deep RL objectives.
A typed one-step transition for discrete-action RL over a tensor-valued state.
- state : Spec.Tensor α σ
Current state
s_t. - action : Fin nActions
Discrete action
a_t. - reward : α
Reward
r_t. - nextState : Spec.Tensor α σ
Next state
s_{t+1}. - done : Bool
Episode termination flag.
Instances For
Build a typed one-hot action vector.
This is an alias for NN.Tensor.oneHot, kept under the RL namespace for ergonomics.
Instances For
Fixed-Horizon Tensor Trajectory Helpers #
Many RL implementations store a rollout buffer in fixed-size tensors (e.g. T × ... for a time
window, or N × ... for a batch). The spec-layer definitions in NN.Spec.RL.Core are list-based
because episode length is data-dependent, but it is still useful to have “PyTorch-shaped”
fixed-horizon variants when:
- you already committed to a horizon
n, and - you want to keep the data in typed tensors all the way through your update step.
The helpers below are the tensor analogues of:
discountedReturnsFrom/discountedReturns/discountedReturnsDone,generalizedAdvantageEstimation,returnsFromAdvantages.They are total and use an internal array-backed right-to-left scan (no list conversion), while the public API stays tensor-shaped.
References:
- Sutton and Barto, Reinforcement Learning: An Introduction, Section 3 (returns) and 12 (actor-critic).
- Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
Fixed-horizon discounted returns with a bootstrap on the far right:
G_t = r_t + γ G_{t+1}.
This is the tensor-shaped sibling of Spec.RL.discountedReturnsFrom.
Instances For
Fixed-horizon discounted returns for a terminal trajectory (bootstrap defaults to 0).
Instances For
Fixed-horizon discounted returns with explicit termination markers.
When done_t = true, the future return is reset before bootstrapping the current reward.
Instances For
Fixed-horizon Generalized Advantage Estimation (GAE) as a vector tensor.
This is the tensor-shaped sibling of Spec.RL.generalizedAdvantageEstimation.
Instances For
Fixed-horizon lambda-returns from advantages and baseline values via R_t = A_t + V_t.
Instances For
Squared-error helper used by critic / TD objectives.
Instances For
Scalar Huber-style loss used by robust TD objectives.
We use the standard piecewise form:
- quadratic region:
(1 / (2 * delta)) * (pred - target)^2 - linear region:
|pred - target| - delta / 2