TorchLean API

NN.Runtime.RL.PPO.Rollout

PPO Rollouts (Discrete Actions) #

This file defines:

The goal is not to hide PPO’s math: the GAE/return definitions live in NN.Spec.RL.Core and the tensor-shaped analogues live in NN.Runtime.RL.Core. This file is the typed rollout layer for PPO training loops.

References:

Shapes #

For a fixed horizon T, PPO minibatches are typically stored in "PyTorch-shaped" tensors:

@[reducible, inline]
abbrev Runtime.RL.PPO.StateBatchShape (horizon : ) (obsShape : Spec.Shape) :

Batch shape for a fixed-horizon sequence of observations: horizon × obsShape.

Instances For
    @[reducible, inline]
    abbrev Runtime.RL.PPO.LogitsBatchShape (horizon nActions : ) :

    Batch shape for a fixed-horizon sequence of action logits: horizon × nActions.

    Instances For
      @[reducible, inline]

      Batch shape for a fixed-horizon sequence of scalars: horizon.

      Instances For
        @[reducible, inline]

        Batch shape for a fixed-horizon sequence of scalar values stored as a column: horizon × 1.

        Instances For

          Rollouts #

          structure Runtime.RL.PPO.Step (α : Type) (obsShape : Spec.Shape) (nActions : ) :

          One fixed-horizon PPO step record.

          This is the “typed parallel arrays” data layout commonly used in PPO implementations, but kept as a single record so downstream code cannot accidentally desynchronize fields.

          • state : Spec.Tensor α obsShape

            Observation s_t (already cast into the training scalar backend).

          • action : Fin nActions

            Sampled action a_t.

          • oldLogProb : α

            Log-probability log π_old(a_t | s_t) under the behavior policy.

          • reward : α

            Reward r_t.

          • done : Bool

            Episode boundary marker (Gym-style terminated || truncated).

          • value : α

            Baseline value prediction V(s_t).

          • nextValue : α

            Bootstrap value prediction V(s_{t+1}) (before any auto-reset).

          Instances For
            structure Runtime.RL.PPO.Rollout (α : Type) (obsShape : Spec.Shape) (nActions horizon : ) :

            Fixed-horizon rollout buffer for PPO.

            The steps_size_eq_horizon field records the invariant that the buffer has exactly horizon steps; this lets downstream tensor conversion be total without runtime bounds checks.

            • steps : Array (Step α obsShape nActions)
            • steps_size_eq_horizon : self.steps.size = horizon

              Invariant: fixed-horizon rollouts always have exactly horizon steps.

            Instances For
              def Runtime.RL.PPO.Rollout.toActorCriticSample {α : Type} [Context α] {obsShape : Spec.Shape} {nActions horizon : } [Fact (0 < horizon)] [Fact (0 < nActions)] (gamma lam : α) (r : Rollout α obsShape nActions horizon) :

              Convert a fixed-horizon rollout into the PPO minibatch expected by Autograd.ppoActorCriticScalarModuleDef.

              Notes:

              • Advantages are normalized (z-score) for the policy-gradient term, a common PPO variance-reduction practice. Value targets (lambda-returns) are computed from the unnormalized advantages.
              Instances For