PPO Rollout Collection (Checked Sessions) #

This file provides the rollout-collection loop used by executable PPO workflows. The key goals are:

keep data collection typed and total (no “parallel arrays” that can desync),
enforce the trust-boundary contract on every step (external Gymnasium or Lean-native env), and
keep the API usable: callers should not need to thread a dozen actor/critic compilation details through every function call.

The unified session interface lives in NN.Runtime.RL.Session (Session.CheckedSession). The lower-level Gymnasium subprocess protocol is implemented in NN.Runtime.RL.Gymnasium.

References:

Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
Gymnasium API docs (reset/step, terminated vs truncated): https://gymnasium.farama.org/

Rollout collection (ergonomic core API) #

source

def Runtime.RL.PPO.collectRolloutSessionWith {α : Type} [Context α] {obsShape : Spec.Shape} {nActions horizon : ℕ} {Sess : Type} [Fact (0 < horizon)] [Fact (0 < nActions)] (start : IO Sess) (observe : Sess → Spec.Tensor Float obsShape) (stepChecked : Sess → Fin nActions → IO (Boundary.Transition obsShape nActions × Sess)) (castObs castReward : Float → α) (predictLogits : Spec.Tensor α obsShape → Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (predictValue : Spec.Tensor α obsShape → α) (rngSeed rngCounter : ℕ) :

IO (Rollout α obsShape nActions horizon × ℕ)

Collect a fixed-horizon rollout from any stateful environment session that can produce fully-observed, contract-checked transitions.

The caller provides:

start: how to initialize the session (often reset),
observe: how to read the current observation from the session,
stepChecked: one checked step returning an observed transition and the updated session,
castObs to inject host Float observations into the chosen scalar backend α,
castReward to inject host Float rewards into the chosen scalar backend α,
predictLogits for the current actor,
predictValue for the current critic (returns a scalar α).

This keeps the PPO runtime API small while still supporting the “compiled model + parameters” calling convention used throughout TorchLean.

Instances For

Rollout collection from a checked session #

source

def Runtime.RL.PPO.collectRolloutCheckedSessionWith {α : Type} [Context α] {obsShape : Spec.Shape} {nActions horizon : ℕ} [Fact (0 < horizon)] [Fact (0 < nActions)] (sess : Session.CheckedSession obsShape nActions) (castObs castReward : Float → α) (predictLogits : Spec.Tensor α obsShape → Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (predictValue : Spec.Tensor α obsShape → α) (rngSeed rngCounter : ℕ) :

IO (Rollout α obsShape nActions horizon × ℕ)

Collect a fixed-horizon rollout from a unified Runtime.RL.Session.CheckedSession.

Instances For

Rollout collection from Gymnasium (subprocess bridge) #

source

def Runtime.RL.PPO.collectRolloutWith {α : Type} [Context α] {obsShape : Spec.Shape} {nActions horizon : ℕ} [Fact (0 < horizon)] [Fact (0 < nActions)] (castObs castReward : Float → α) (gym : Gymnasium.Client obsShape nActions) (predictLogits : Spec.Tensor α obsShape → Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (predictValue : Spec.Tensor α obsShape → α) (rngSeed rngCounter resetSeed : ℕ) :

IO (Rollout α obsShape nActions horizon × ℕ)

Collect a fixed-horizon rollout from a Gymnasium subprocess environment.

This is a thin wrapper around collectRolloutSessionWith specialized to Gymnasium.Session.

Instances For

TorchLean API

NN.Runtime.RL.PPO.Collect

PPO Rollout Collection (Checked Sessions) #

Rollout collection (ergonomic core API) #

Rollout collection from a checked session #

Rollout collection from Gymnasium (subprocess bridge) #