PPO Rollouts (Discrete Actions) #

This file defines:

fixed-horizon PPO rollout records stored as typed tensors / arrays, and
a conversion to the minibatch format expected by the PPO autograd loss module (Runtime.RL.PolicyGradient.Autograd.ppoActorCriticScalarModuleDef).

The goal is not to hide PPO’s math: the GAE/return definitions live in NN.Spec.RL.Core and the tensor-shaped analogues live in NN.Runtime.RL.Core. This file is the typed rollout layer for PPO training loops.

References:

Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438

Shapes #

For a fixed horizon T, PPO minibatches are typically stored in "PyTorch-shaped" tensors:

states : (T × obsShape)
actionsOneHot : (T × nActions)
oldLogProb : (T)
advantages : (T)
valueTarget : (T × 1)

source

@[reducible, inline]

abbrev Runtime.RL.PPO.StateBatchShape (horizon : ℕ) (obsShape : Spec.Shape) :

Spec.Shape

Batch shape for a fixed-horizon sequence of observations: horizon × obsShape.

Instances For

source

@[reducible, inline]

abbrev Runtime.RL.PPO.LogitsBatchShape (horizon nActions : ℕ) :

Spec.Shape

Batch shape for a fixed-horizon sequence of action logits: horizon × nActions.

Instances For

source

@[reducible, inline]

abbrev Runtime.RL.PPO.ScalarBatchShape (horizon : ℕ) :

Spec.Shape

Batch shape for a fixed-horizon sequence of scalars: horizon.

Instances For

source

@[reducible, inline]

abbrev Runtime.RL.PPO.ValueBatchShape (horizon : ℕ) :

Spec.Shape

Batch shape for a fixed-horizon sequence of scalar values stored as a column: horizon × 1.

Instances For

Rollouts #

source

structure Runtime.RL.PPO.Step (α : Type) (obsShape : Spec.Shape) (nActions : ℕ) :

Type

One fixed-horizon PPO step record.

This is the “typed parallel arrays” data layout commonly used in PPO implementations, but kept as a single record so downstream code cannot accidentally desynchronize fields.

state : Spec.Tensor α obsShape
Observation s_t (already cast into the training scalar backend).
action : Fin nActions
Sampled action a_t.
oldLogProb : α
Log-probability log π_old(a_t | s_t) under the behavior policy.
reward : α
Reward r_t.
done : Bool
Episode boundary marker (Gym-style terminated || truncated).
value : α
Baseline value prediction V(s_t).
nextValue : α
Bootstrap value prediction V(s_{t+1}) (before any auto-reset).

Instances For

source

structure Runtime.RL.PPO.Rollout (α : Type) (obsShape : Spec.Shape) (nActions horizon : ℕ) :

Type

Fixed-horizon rollout buffer for PPO.

The steps_size_eq_horizon field records the invariant that the buffer has exactly horizon steps; this lets downstream tensor conversion be total without runtime bounds checks.

steps : Array (Step α obsShape nActions)
steps_size_eq_horizon : self.steps.size = horizon
Invariant: fixed-horizon rollouts always have exactly horizon steps.

Instances For

source

def Runtime.RL.PPO.Rollout.toActorCriticSample {α : Type} [Context α] {obsShape : Spec.Shape} {nActions horizon : ℕ} [Fact (0 < horizon)] [Fact (0 < nActions)] (gamma lam : α) (r : Rollout α obsShape nActions horizon) :

IO (Autograd.Torch.TList α [StateBatchShape horizon obsShape, LogitsBatchShape horizon nActions, ScalarBatchShape horizon, ScalarBatchShape horizon, ValueBatchShape horizon])

Convert a fixed-horizon rollout into the PPO minibatch expected by Autograd.ppoActorCriticScalarModuleDef.

Notes:

Advantages are normalized (z-score) for the policy-gradient term, a common PPO variance-reduction practice. Value targets (lambda-returns) are computed from the unnormalized advantages.

Instances For

TorchLean API

NN.Runtime.RL.PPO.Rollout

PPO Rollouts (Discrete Actions) #

Shapes #

Rollouts #