Core Reinforcement-Learning Runtime Helpers #

This module adds the tensor-shaped and runtime-facing pieces that sit on top of the mathematical RL core in NN.Spec.RL.Core.

Keeping Bellman / return / GAE definitions in the spec layer avoids an awkward split where the same mathematics would otherwise exist in both runtime and proof namespaces. This file therefore only keeps:

typed transition records for tensor-valued or indexed rollouts,
small action-encoding helpers,
and scalar losses commonly used by deep RL objectives.

source

structure Runtime.RL.Core.Transition (α : Type) (σ : Spec.Shape) (nActions : ℕ) :

Type

A typed one-step transition for discrete-action RL over a tensor-valued state.

state : Spec.Tensor α σ
Current state s_t.
action : Fin nActions
Discrete action a_t.
reward : α
Reward r_t.
nextState : Spec.Tensor α σ
Next state s_{t+1}.
done : Bool
Episode termination flag.

Instances For

source

structure Runtime.RL.Core.IndexedTransition (α : Type) (nStates nActions : ℕ) :

Type

A typed one-step transition for tabular RL over finite state/action spaces.

state : Fin nStates
Current state index s_t.
action : Fin nActions
Discrete action a_t.
reward : α
Reward r_t.
nextState : Fin nStates
Next state index s_{t+1}.
done : Bool
Episode termination flag.

Instances For

source

@[reducible, inline]

abbrev Runtime.RL.Core.oneHotAction {α : Type} [Context α] {nActions : ℕ} (action : Fin nActions) :

Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)

Build a typed one-hot action vector.

This is an alias for NN.Tensor.oneHot, kept under the RL namespace for ergonomics.

Instances For

Fixed-Horizon Tensor Trajectory Helpers #

Many RL implementations store a rollout buffer in fixed-size tensors (e.g. T × ... for a time window, or N × ... for a batch). The spec-layer definitions in NN.Spec.RL.Core are list-based because episode length is data-dependent, but it is still useful to have “PyTorch-shaped” fixed-horizon variants when:

you already committed to a horizon n, and
you want to keep the data in typed tensors all the way through your update step.

The helpers below are the tensor analogues of:

discountedReturnsFrom / discountedReturns / discountedReturnsDone,
generalizedAdvantageEstimation,
returnsFromAdvantages.
They are total and use an internal array-backed right-to-left scan (no list conversion), while the public API stays tensor-shaped.

References:

Sutton and Barto, Reinforcement Learning: An Introduction, Section 3 (returns) and 12 (actor-critic).
Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438

source

def Runtime.RL.Core.discountedReturnsVecFrom {α : Type} [Context α] {n : ℕ} (gamma : α) (rewards : Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)) (bootstrap : α := 0) :

Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)

Fixed-horizon discounted returns with a bootstrap on the far right: G_t = r_t + γ G_{t+1}.

This is the tensor-shaped sibling of Spec.RL.discountedReturnsFrom.

Instances For

source

def Runtime.RL.Core.discountedReturnsVec {α : Type} [Context α] {n : ℕ} (gamma : α) (rewards : Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)) :

Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)

Fixed-horizon discounted returns for a terminal trajectory (bootstrap defaults to 0).

Instances For

source

def Runtime.RL.Core.discountedReturnsVecDone {α : Type} [Context α] {n : ℕ} (gamma : α) (rewards : Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)) (dones : Spec.Tensor Bool (Spec.Shape.dim n Spec.Shape.scalar)) (bootstrap : α := 0) :

Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)

Fixed-horizon discounted returns with explicit termination markers.

When done_t = true, the future return is reset before bootstrapping the current reward.

Instances For

source

def Runtime.RL.Core.generalizedAdvantageEstimationVec {α : Type} [Context α] {n : ℕ} (gamma lam : α) (rewards values nextValues : Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)) (dones : Spec.Tensor Bool (Spec.Shape.dim n Spec.Shape.scalar)) :

Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)

Fixed-horizon Generalized Advantage Estimation (GAE) as a vector tensor.

This is the tensor-shaped sibling of Spec.RL.generalizedAdvantageEstimation.

Instances For

source

def Runtime.RL.Core.returnsFromAdvantagesVec {α : Type} [Context α] {n : ℕ} (advantages values : Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)) :

Spec.Tensor α (Spec.Shape.dim n Spec.Shape.scalar)

Fixed-horizon lambda-returns from advantages and baseline values via R_t = A_t + V_t.

Instances For

source

def Runtime.RL.Core.squaredError {α : Type} [Context α] (prediction target : α) :

Squared-error helper used by critic / TD objectives.

Instances For

source

def Runtime.RL.Core.huberLoss {α : Type} [Context α] (prediction target : α) (delta : α := 1) :

Scalar Huber-style loss used by robust TD objectives.

We use the standard piecewise form:

quadratic region: (1 / (2 * delta)) * (pred - target)^2
linear region: |pred - target| - delta / 2

Instances For

TorchLean API

NN.Runtime.RL.Core

Core Reinforcement-Learning Runtime Helpers #

Fixed-Horizon Tensor Trajectory Helpers #