PPO on Lean-native GridWorld (Executable Demo + Formal Model) #
This example complements torchlean ppo_cartpole:
torchlean ppo_cartpoleuses an external Python Gymnasium environment and checks every step against a Lean-side trust-boundary contract (Runtime.RL.Boundary.Contract).- This example uses a Lean-native GridWorld and still runs the same PPO update in Lean.
Even though the environment is defined in Lean, we still validate every transition with the
boundary checker. That keeps the data model unified: downstream training code consumes
Spec.RL.ObservedTransition in a single format regardless of whether the source is a Lean-native
environment or an external sampler.
Formal hooks #
- The environment has an induced finite stochastic MDP (
Spec.RL.FiniteStochastic.MDP) and we import a proof that it is well-formed (row-stochastic transition rows,0 ≤ γ < 1). - The boundary checker can be turned into a Prop-level hypothesis via
Proofs.RL.Boundary.contractHolds_of_checkTransitionFin_eq_ok(seeNN/Proofs/RL/Boundary.lean), or you can use the proof-layer Gymnasium wrapperRuntime.RL.Gymnasium.Session.stepCheckedWithProof(NN/Proofs/RL/Gymnasium.lean) for external environments.
CLI flags #
--cuda: run the Torch backend on CUDA (requires building with-K cuda=true).--updates <n>: number of PPO updates to run.--eval-every <n>: evaluate the greedy policy everynupdates.--eval-episodes <n>: number of evaluation episodes per checkpoint.--eval-max-steps <n>: maximum steps per evaluation episode.--log <path>,--policy <path>,--path <path>: override artifact output paths.
Run (from the repo root):
lake exe torchlean ppo_gridworld
lake build -R -K cuda=true && lake exe torchlean ppo_gridworld --cuda
lake exe torchlean ppo_gridworld --updates 200
Artifacts:
- The executable writes widget-friendly JSON snapshots to
data/rl/by default:ppo_gridworld_trainlog.json,ppo_gridworld_policy.json,ppo_gridworld_path.json(override with--log,--policy,--path). - You can also tune runtime cost with:
--updates,--eval-every,--eval-episodes,--eval-max-steps. - Visualize them in the editor via
NN/Examples/RL/PPOGridWorldView.lean.
What this example does (and does not) guarantee #
- Because the environment dynamics are Lean code, you can reason about its properties directly (e.g. determinism, Markov property w.r.t. the explicit state, bounded rewards).
- The PPO/GAE update is implemented as Lean definitions and a TorchLean autograd program, so it is a natural target for formal proofs about the update equation.
- As in most practical PPO code, convergence and optimality are not guaranteed by this example; it is tuned for inspectability and type safety, not leaderboard performance.
References (primary):
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
- Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
- Williams, "Simple statistical gradient-following algorithms for connectionist reinforcement learning" (REINFORCE, 1992): https://doi.org/10.1007/BF00992696
- Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., GridWorld examples): http://incompleteideas.net/book/the-book-2nd.html
- Puterman, Markov Decision Processes (finite discounted MDPs): https://doi.org/10.1002/9780470316887
Name of this executable target (used in CLI error messages and banners).
Instances For
Configuration #
Number of discrete actions (up/down/left/right).
Instances For
PPO rollout horizon (also the training batch size for this example).
Instances For
Discount factor used in returns / GAE.
Instances For
GAE(λ) parameter controlling the bias/variance tradeoff of advantage estimates.
Instances For
Number of PPO optimization epochs per collected rollout batch.
Instances For
Default maximum number of PPO updates (can be overridden by --updates).
Instances For
Default evaluation checkpoint interval (can be overridden by --eval-every).
Instances For
Default evaluation episodes per checkpoint (can be overridden by --eval-episodes).
Instances For
The observation tensor shape used by this example: [..., nStates] one-hot vectors.
Instances For
Formal GridWorld model (spec/proof layer) #
We define a real-valued GridWorld model and record the proof that its stochastic-MDP view is valid.
This proof is not used by the executable training loop directly; it exists so that downstream theorems about the induced MDP can refer to a concrete environment used in an example.
Start position (top-left cell).
Instances For
Goal position (bottom-right cell).
Instances For
A discount factor in [0,1) at the proof layer (ℝ), used to build an MDP instance.
Instances For
Proof-layer GridWorld instance over ℝ rewards/discounts.
Instances For
The induced finite stochastic MDP for gwR is well-formed (0 ≤ γ < 1, row-stochastic transitions).
Lean-native runtime environment #
We implement a Gym-style environment (Spec.RL.Env) whose observations are one-hot vectors
over the flattened finite state space Fin nStates.
This keeps the PPO code identical to the Gymnasium example: the policy consumes tensors and produces logits over a finite action set.
Observation function: encode the discrete state as a one-hot vector of length nStates.
Instances For
Absolute difference on natural-number coordinates, returned as a Float.
Instances For
Manhattan distance to the goal.
Instances For
Deterministic GridWorld transition function with dense progress rewards.
The original sparse -1 until terminal reward gave short runs too little learning signal: random rollouts
rarely found the goal, so PPO received almost no useful signal. This shaped reward keeps the same
goal-reaching task, but gives the learner immediate credit for moving closer to the goal and a
small penalty for dithering.
Instances For
Lean-native environment packaged as a Spec.RL.Env for reuse with the generic RL runtime.
Instances For
Trust boundary contract #
Even though this environment is Lean-native, we keep a contract in play to exercise the “checked preconditions” workflow and to keep the interface identical to the external Gymnasium collector.
Model (Actor + Critic) #
Construct the actor network as an MLP mapping one-hot observations to action logits.
Instances For
Construct the critic network as an MLP mapping one-hot observations to a scalar value estimate.
Instances For
Rollout collection (Lean-native environment) #
Collect a fixed-horizon PPO rollout from the Lean-native environment using a checked session.
This is an example-local wrapper around rl.ppo.collectRolloutCheckedSessionWith that packages:
- a
Spec.RL.Envas arl.session.CheckedSession, and - the (actor, critic) prediction functions at observation shape.
Instances For
Evaluation #
Evaluation helpers live in NN.API.rl.eval (runtime module NN.Runtime.RL.Eval).
Main Training Loop #
Entry point for lake exe torchlean ppo_gridworld.
This executable:
- runs PPO updates against a Lean-native GridWorld environment,
- periodically evaluates the greedy policy and logs the average return,
- writes widget-friendly JSON artifacts (training curve, greedy policy snapshot, greedy path snapshot).