PPO on Lean-native GridWorld (Executable Demo + Formal Model) #

This example complements torchlean ppo_cartpole:

torchlean ppo_cartpole uses an external Python Gymnasium environment and checks every step against a Lean-side trust-boundary contract (Runtime.RL.Boundary.Contract).
This example uses a Lean-native GridWorld and still runs the same PPO update in Lean.

Even though the environment is defined in Lean, we still validate every transition with the boundary checker. That keeps the data model unified: downstream training code consumes Spec.RL.ObservedTransition in a single format regardless of whether the source is a Lean-native environment or an external sampler.

Formal hooks #

The environment has an induced finite stochastic MDP (Spec.RL.FiniteStochastic.MDP) and we import a proof that it is well-formed (row-stochastic transition rows, 0 ≤ γ < 1).
The boundary checker can be turned into a Prop-level hypothesis via Proofs.RL.Boundary.contractHolds_of_checkTransitionFin_eq_ok (see NN/Proofs/RL/Boundary.lean), or you can use the proof-layer Gymnasium wrapper Runtime.RL.Gymnasium.Session.stepCheckedWithProof (NN/Proofs/RL/Gymnasium.lean) for external environments.

CLI flags #

--cuda: run the Torch backend on CUDA (requires building with -K cuda=true).
--updates <n>: number of PPO updates to run.
--eval-every <n>: evaluate the greedy policy every n updates.
--eval-episodes <n>: number of evaluation episodes per checkpoint.
--eval-max-steps <n>: maximum steps per evaluation episode.
--log <path>, --policy <path>, --path <path>: override artifact output paths.

Run (from the repo root):

lake exe torchlean ppo_gridworld
lake build -R -K cuda=true && lake exe torchlean ppo_gridworld --cuda
lake exe torchlean ppo_gridworld --updates 200

Artifacts:

The executable writes widget-friendly JSON snapshots to data/rl/ by default: ppo_gridworld_trainlog.json, ppo_gridworld_policy.json, ppo_gridworld_path.json (override with --log, --policy, --path).
You can also tune runtime cost with: --updates, --eval-every, --eval-episodes, --eval-max-steps.
Visualize them in the editor via NN/Examples/RL/PPOGridWorldView.lean.

What this example does (and does not) guarantee #

Because the environment dynamics are Lean code, you can reason about its properties directly (e.g. determinism, Markov property w.r.t. the explicit state, bounded rewards).
The PPO/GAE update is implemented as Lean definitions and a TorchLean autograd program, so it is a natural target for formal proofs about the update equation.
As in most practical PPO code, convergence and optimality are not guaranteed by this example; it is tuned for inspectability and type safety, not leaderboard performance.

References (primary):

Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
Williams, "Simple statistical gradient-following algorithms for connectionist reinforcement learning" (REINFORCE, 1992): https://doi.org/10.1007/BF00992696
Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., GridWorld examples): http://incompleteideas.net/book/the-book-2nd.html
Puterman, Markov Decision Processes (finite discounted MDPs): https://doi.org/10.1002/9780470316887

PPO on Lean-native GridWorld (Executable Demo + Formal Model) #

Formal hooks #

CLI flags #

What this example does (and does not) guarantee #

Configuration #

Formal GridWorld model (spec/proof layer) #

Lean-native runtime environment #

Trust boundary contract #

Model (Actor + Critic) #

Rollout collection (Lean-native environment) #

Evaluation #

Main Training Loop #