PPO Rollouts (Discrete Actions) #
This file defines:
- fixed-horizon PPO rollout records stored as typed tensors / arrays, and
- a conversion to the minibatch format expected by the PPO autograd loss module
(
Runtime.RL.PolicyGradient.Autograd.ppoActorCriticScalarModuleDef).
The goal is not to hide PPO’s math: the GAE/return definitions live in NN.Spec.RL.Core and the
tensor-shaped analogues live in NN.Runtime.RL.Core. This file is the typed rollout layer for
PPO training loops.
References:
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
- Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
Shapes #
For a fixed horizon T, PPO minibatches are typically stored in "PyTorch-shaped" tensors:
states : (T × obsShape)actionsOneHot : (T × nActions)oldLogProb : (T)advantages : (T)valueTarget : (T × 1)
Batch shape for a fixed-horizon sequence of observations: horizon × obsShape.
Instances For
Batch shape for a fixed-horizon sequence of action logits: horizon × nActions.
Instances For
Batch shape for a fixed-horizon sequence of scalars: horizon.
Instances For
Batch shape for a fixed-horizon sequence of scalar values stored as a column: horizon × 1.
Instances For
Rollouts #
One fixed-horizon PPO step record.
This is the “typed parallel arrays” data layout commonly used in PPO implementations, but kept as a single record so downstream code cannot accidentally desynchronize fields.
- state : Spec.Tensor α obsShape
Observation
s_t(already cast into the training scalar backend). - action : Fin nActions
Sampled action
a_t. - oldLogProb : α
Log-probability
log π_old(a_t | s_t)under the behavior policy. - reward : α
Reward
r_t. - done : Bool
Episode boundary marker (Gym-style
terminated || truncated). - value : α
Baseline value prediction
V(s_t). - nextValue : α
Bootstrap value prediction
V(s_{t+1})(before any auto-reset).
Instances For
Fixed-horizon rollout buffer for PPO.
The steps_size_eq_horizon field records the invariant that the buffer has exactly horizon
steps; this lets downstream tensor conversion be total without runtime bounds checks.
Invariant: fixed-horizon rollouts always have exactly
horizonsteps.
Instances For
Convert a fixed-horizon rollout into the PPO minibatch expected by
Autograd.ppoActorCriticScalarModuleDef.
Notes:
- Advantages are normalized (z-score) for the policy-gradient term, a common PPO variance-reduction practice. Value targets (lambda-returns) are computed from the unnormalized advantages.