PPO on Gymnasium CartPole (Executable Example) #
This example is small but complete:
- Environment: external Python Gymnasium (started as a subprocess).
- Trust boundary: every step is checked against a Lean-side contract
(
Runtime.RL.Boundary.Contract) before being used for training data. - Algorithm: PPO with GAE (all update math is Lean definitions; the PPO loss is a TorchLean autograd program).
More concretely:
- The policy is a categorical distribution over discrete actions parameterized by logits
π_θ(a | s) = softmax(logits_θ(s)). - Advantages are computed using Generalized Advantage Estimation (GAE(λ)).
- PPO uses the clipped surrogate objective (plus a value-loss and optional entropy bonus, depending on the runtime configuration).
CLI flags #
--cuda: run the Torch backend on CUDA (requires building with-K cuda=true).--seed <n>: deterministic seed for TorchLean RNG streams (and evaluation seeding).--updates <n>: limit the number of PPO rollout/update cycles.--log <path>: write the widget log JSON to a custom path.
Run (from the repo root):
python3 -m pip install --user 'gymnasium>=1.0'
lake exe -K cuda=true torchlean ppo_cartpole --cuda --updates 1 --eval-every 1 --eval-episodes 1 --eval-max-steps 8
Artifacts:
- The executable writes a widget-friendly training curve JSON to
data/rl/ppo_cartpole_trainlog.json(override with--log <path>). - Visualize it in the editor via
NN/Examples/RL/PPOCartPoleView.lean.
What this run does (and does not) guarantee #
- The PPO/GAE math and the autograd loss program are Lean definitions, so they are suitable targets for formal reasoning.
- When Gymnasium is external, TorchLean cannot prove the environment satisfies Markov/measurability assumptions. The trust-boundary contract turns some common assumptions (finite tensors, reward bounds, done-flag semantics) into checked preconditions.
- The run favors readability, typed boundaries, and widget inspection over benchmark-specific PPO tuning.
References (primary):
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
- Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
- Williams, "Simple statistical gradient-following algorithms for connectionist reinforcement learning" (REINFORCE, 1992): https://doi.org/10.1007/BF00992696
- Brockman et al., "OpenAI Gym" (2016): https://arxiv.org/abs/1606.01540
- Gymnasium API docs (reset/step,
terminatedvstruncated): https://gymnasium.farama.org/ - CartPole environment docs: https://gymnasium.farama.org/environments/classic_control/cart_pole/
Name of this executable target (used in CLI error messages and banners).
Instances For
Configuration #
This stays with discrete-action CartPole so the native Lean executable remains easy to run and inspect.
Gymnasium environment id passed to the Python subprocess (see Gymnasium docs for supported ids).
Instances For
Relative path to the Python Gymnasium bridge script (spawned as a subprocess).
Instances For
Observation vector dimension for CartPole (Gymnasium reports 4 floats).
Instances For
Number of discrete actions for CartPole (left/right).
Instances For
PPO rollout horizon (also the training batch size for this run).
Instances For
Discount factor used in returns / GAE.
Instances For
GAE(λ) parameter controlling the bias/variance tradeoff of advantage estimates.
Instances For
Adam learning rate used for the CartPole actor-critic update.
Instances For
Number of PPO optimization epochs per collected rollout batch.
Instances For
Maximum number of PPO updates (training stops early if the "solved" criterion triggers).
Instances For
Evaluate (greedy policy) every evalEvery PPO updates.
Instances For
Number of evaluation episodes per checkpoint.
Instances For
Stop early if average return meets/exceeds this threshold.
Instances For
The observation tensor shape used by this run: [..., stateDim].
Instances For
Model (Actor + Critic) #
We use the public TorchLean.nn surface, which provides prefix-shape preserving layers:
if x has shape [..., inDim], nn.Linear inDim outDim maps it to [..., outDim].
Construct the actor network as an MLP mapping observations to action logits.
Instances For
Construct the critic network as an MLP mapping observations to a scalar value estimate.
Instances For
Gymnasium Bridge #
We talk to a small Python service (scripts/rl/gymnasium_server.py) using the reusable runtime
bridge exposed as rl.gym.*.
The Lean-side trust-boundary contract (rl.boundary.Contract) is enforced on every step.
Evaluation #
Evaluation APIs live in rl.eval.
Main Training Loop #
Entry point for lake exe torchlean ppo_cartpole.
This executable:
- launches a Python Gymnasium subprocess for
CartPole-v1, - collects checked rollouts under
rl.boundary.Contract, - performs PPO updates on the Torch backend (CUDA is the practical validation path for this command),
- writes a widget-friendly training curve JSON (default:
data/rl/ppo_cartpole_trainlog.json).