PPO on Gymnasium CartPole (Executable Demo) #

This example is intentionally “small but complete”:

Environment: external Python Gymnasium (started as a subprocess).
Trust boundary: every step is checked against a Lean-side contract (Runtime.RL.Boundary.Contract) before being used for training data.
Algorithm: PPO with GAE (all update math is Lean definitions; the PPO loss is a TorchLean autograd program).

More concretely:

The policy is a categorical distribution over discrete actions parameterized by logits π_θ(a | s) = softmax(logits_θ(s)).
Advantages are computed using Generalized Advantage Estimation (GAE(λ)).
PPO uses the clipped surrogate objective (plus a value-loss and optional entropy bonus, depending on the runtime configuration).

CLI flags #

--cuda: run the Torch backend on CUDA (requires building with -K cuda=true).
--seed <n>: deterministic seed for TorchLean RNG streams (and evaluation seeding).
--updates <n>: limit the number of PPO rollout/update cycles.
--log <path>: write the widget log JSON to a custom path.

Run (from the repo root):

python3 -m pip install --user 'gymnasium>=1.0'
lake exe torchlean ppo_cartpole
lake build -R -K cuda=true && lake exe torchlean ppo_cartpole --cuda

Artifacts:

The executable writes a widget-friendly training curve JSON to data/rl/ppo_cartpole_trainlog.json (override with --log <path>).
Visualize it in the editor via NN/Examples/RL/PPOCartPoleView.lean.

What this run does (and does not) guarantee #

The PPO/GAE math and the autograd loss program are Lean definitions, so they are suitable targets for formal reasoning.
When Gymnasium is external, TorchLean cannot prove the environment satisfies Markov/measurability assumptions. The trust-boundary contract turns some common assumptions (finite tensors, reward bounds, done-flag semantics) into checked preconditions.
This is not a tuned “benchmark PPO” implementation. It is designed to be readable, typed, and easy to inspect with widgets.

References (primary):

Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
Williams, "Simple statistical gradient-following algorithms for connectionist reinforcement learning" (REINFORCE, 1992): https://doi.org/10.1007/BF00992696
Brockman et al., "OpenAI Gym" (2016): https://arxiv.org/abs/1606.01540
Gymnasium API docs (reset/step, terminated vs truncated): https://gymnasium.farama.org/
CartPole environment docs: https://gymnasium.farama.org/environments/classic_control/cart_pole/

source

def NN.Examples.Models.RL.PPOCartPole.exeName :

String

Name of this executable target (used in CLI error messages and banners).

Instances For

Configuration #

We keep this example discrete-action and small (CartPole) so it runs quickly in a native Lean executable.

source

def NN.Examples.Models.RL.PPOCartPole.envId :

String

Gymnasium environment id passed to the Python subprocess (see Gymnasium docs for supported ids).

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.gymServerScript :

String

Relative path to the Python Gymnasium bridge script (spawned as a subprocess).

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.stateDim :

ℕ

Observation vector dimension for CartPole (Gymnasium reports 4 floats).

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.nActions :

ℕ

Number of discrete actions for CartPole (left/right).

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.hiddenDim :

ℕ

Width of the hidden layer in the actor and critic MLPs.

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.horizon :

ℕ

PPO rollout horizon (also the training batch size for this run).

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.gamma :

Float

Discount factor used in returns / GAE.

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.lam :

Float

GAE(λ) parameter controlling the bias/variance tradeoff of advantage estimates.

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.lr :

Float

Adam learning rate.

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.updateEpochs :

ℕ

Number of PPO optimization epochs per collected rollout batch.

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.updatesMax :

ℕ

Maximum number of PPO updates (training stops early if the "solved" criterion triggers).

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.evalEvery :

ℕ

Evaluate (greedy policy) every evalEvery PPO updates.

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.evalEpisodes :

ℕ

Number of evaluation episodes per checkpoint.

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.solvedAvgReturn :

Float

Stop early if average return meets/exceeds this threshold.

Instances For

source

instance NN.Examples.Models.RL.PPOCartPole.instFactLtNatOfNatHorizon :

Fact (0 < horizon)

source

instance NN.Examples.Models.RL.PPOCartPole.instFactLtNatOfNatNActions :

Fact (0 < nActions)

source

def NN.Examples.Models.RL.PPOCartPole.obsShape :

Shape

The observation tensor shape used by this run: [..., stateDim].

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.pfxBatch :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.sStateBatch :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.sLogitsBatch :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.sScalarBatch :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.sValueBatch :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.sState1 :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.sLogits1 :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.sValue1 :

Shape

Instances For

Model (Actor + Critic) #

We use the public API.nn facade, which provides "prefix-shape preserving" layers: if x has shape [..., inDim], nn.linear inDim outDim (pfx := ...) maps it to [..., outDim].

source

def NN.Examples.Models.RL.PPOCartPole.modelCfg :

API.nn.models.PPOActorCriticConfig

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.actorMk (pfx : Shape) :

API.nn.M (API.nn.Sequential (Spec.Shape.appendDim pfx stateDim) (Spec.Shape.appendDim pfx nActions))

Construct the actor network as an MLP mapping observations to action logits.

Instances For

source

def NN.Examples.Models.RL.PPOCartPole.criticMk (pfx : Shape) :

API.nn.M (API.nn.Sequential (Spec.Shape.appendDim pfx stateDim) (Spec.Shape.appendDim pfx 1))

Construct the critic network as an MLP mapping observations to a scalar value estimate.

Instances For

Gymnasium Bridge #

We talk to a small Python helper (scripts/rl/gymnasium_server.py) using the reusable runtime bridge in Runtime.RL.Gymnasium (exposed as rl.gym.*).

The Lean-side trust-boundary contract (rl.boundary.Contract) is enforced on every step.

Evaluation #

Evaluation helpers live in NN.API.rl.eval (runtime module NN.Runtime.RL.Eval).

Main Training Loop #

source

def NN.Examples.Models.RL.PPOCartPole.main (args : List String) :

IO UInt32

Entry point for lake exe torchlean ppo_cartpole.

This executable:

launches a Python Gymnasium subprocess for CartPole-v1,
collects checked rollouts under rl.boundary.Contract,
performs PPO updates on the Torch backend (CPU or CUDA via --cuda),
writes a widget-friendly training curve JSON (default: data/rl/ppo_cartpole_trainlog.json).

Instances For