PPO on Gymnasium CartPole (Executable Demo) #
This example is intentionally “small but complete”:
- Environment: external Python Gymnasium (started as a subprocess).
- Trust boundary: every step is checked against a Lean-side contract
(
Runtime.RL.Boundary.Contract) before being used for training data. - Algorithm: PPO with GAE (all update math is Lean definitions; the PPO loss is a TorchLean autograd program).
More concretely:
- The policy is a categorical distribution over discrete actions parameterized by logits
π_θ(a | s) = softmax(logits_θ(s)). - Advantages are computed using Generalized Advantage Estimation (GAE(λ)).
- PPO uses the clipped surrogate objective (plus a value-loss and optional entropy bonus, depending on the runtime configuration).
CLI flags #
--cuda: run the Torch backend on CUDA (requires building with-K cuda=true).--seed <n>: deterministic seed for TorchLean RNG streams (and evaluation seeding).--updates <n>: limit the number of PPO rollout/update cycles.--log <path>: write the widget log JSON to a custom path.
Run (from the repo root):
python3 -m pip install --user 'gymnasium>=1.0'
lake exe torchlean ppo_cartpole
lake build -R -K cuda=true && lake exe torchlean ppo_cartpole --cuda
Artifacts:
- The executable writes a widget-friendly training curve JSON to
data/rl/ppo_cartpole_trainlog.json(override with--log <path>). - Visualize it in the editor via
NN/Examples/RL/PPOCartPoleView.lean.
What this run does (and does not) guarantee #
- The PPO/GAE math and the autograd loss program are Lean definitions, so they are suitable targets for formal reasoning.
- When Gymnasium is external, TorchLean cannot prove the environment satisfies Markov/measurability assumptions. The trust-boundary contract turns some common assumptions (finite tensors, reward bounds, done-flag semantics) into checked preconditions.
- This is not a tuned “benchmark PPO” implementation. It is designed to be readable, typed, and easy to inspect with widgets.
References (primary):
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
- Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
- Williams, "Simple statistical gradient-following algorithms for connectionist reinforcement learning" (REINFORCE, 1992): https://doi.org/10.1007/BF00992696
- Brockman et al., "OpenAI Gym" (2016): https://arxiv.org/abs/1606.01540
- Gymnasium API docs (reset/step,
terminatedvstruncated): https://gymnasium.farama.org/ - CartPole environment docs: https://gymnasium.farama.org/environments/classic_control/cart_pole/
Name of this executable target (used in CLI error messages and banners).
Instances For
Configuration #
We keep this example discrete-action and small (CartPole) so it runs quickly in a native Lean executable.
Gymnasium environment id passed to the Python subprocess (see Gymnasium docs for supported ids).
Instances For
Relative path to the Python Gymnasium bridge script (spawned as a subprocess).
Instances For
Observation vector dimension for CartPole (Gymnasium reports 4 floats).
Instances For
Number of discrete actions for CartPole (left/right).
Instances For
PPO rollout horizon (also the training batch size for this run).
Instances For
Discount factor used in returns / GAE.
Instances For
GAE(λ) parameter controlling the bias/variance tradeoff of advantage estimates.
Instances For
Number of PPO optimization epochs per collected rollout batch.
Instances For
Maximum number of PPO updates (training stops early if the "solved" criterion triggers).
Instances For
Evaluate (greedy policy) every evalEvery PPO updates.
Instances For
Number of evaluation episodes per checkpoint.
Instances For
Stop early if average return meets/exceeds this threshold.
Instances For
The observation tensor shape used by this run: [..., stateDim].
Instances For
Model (Actor + Critic) #
We use the public API.nn facade, which provides "prefix-shape preserving" layers:
if x has shape [..., inDim], nn.linear inDim outDim (pfx := ...) maps it to [..., outDim].
Construct the actor network as an MLP mapping observations to action logits.
Instances For
Construct the critic network as an MLP mapping observations to a scalar value estimate.
Instances For
Gymnasium Bridge #
We talk to a small Python helper (scripts/rl/gymnasium_server.py) using the reusable runtime
bridge in Runtime.RL.Gymnasium (exposed as rl.gym.*).
The Lean-side trust-boundary contract (rl.boundary.Contract) is enforced on every step.
Evaluation #
Evaluation helpers live in NN.API.rl.eval (runtime module NN.Runtime.RL.Eval).
Main Training Loop #
Entry point for lake exe torchlean ppo_cartpole.
This executable:
- launches a Python Gymnasium subprocess for
CartPole-v1, - collects checked rollouts under
rl.boundary.Contract, - performs PPO updates on the Torch backend (CPU or CUDA via
--cuda), - writes a widget-friendly training curve JSON (default:
data/rl/ppo_cartpole_trainlog.json).