TorchLean API

NN.Examples.Models.RL.PPOGridWorld

PPO on Lean-native GridWorld (Executable Demo + Formal Model) #

This example complements torchlean ppo_cartpole:

Even though the environment is defined in Lean, we still validate every transition with the boundary checker. That keeps the data model unified: downstream training code consumes Spec.RL.ObservedTransition in a single format regardless of whether the source is a Lean-native environment or an external sampler.

Formal hooks #

  1. The environment has an induced finite stochastic MDP (Spec.RL.FiniteStochastic.MDP) and we import a proof that it is well-formed (row-stochastic transition rows, 0 ≤ γ < 1).
  2. The boundary checker can be turned into a Prop-level hypothesis via Proofs.RL.Boundary.contractHolds_of_checkTransitionFin_eq_ok (see NN/Proofs/RL/Boundary.lean), or you can use the proof-layer Gymnasium wrapper Runtime.RL.Gymnasium.Session.stepCheckedWithProof (NN/Proofs/RL/Gymnasium.lean) for external environments.

CLI flags #

Run (from the repo root):

lake exe torchlean ppo_gridworld
lake build -R -K cuda=true && lake exe torchlean ppo_gridworld --cuda
lake exe torchlean ppo_gridworld --updates 200

Artifacts:

What this example does (and does not) guarantee #

References (primary):

Name of this executable target (used in CLI error messages and banners).

Instances For

    Configuration #

    Grid width (number of columns).

    Instances For

      Grid height (number of rows).

      Instances For

        Total number of discrete states (width * height).

        Instances For

          Number of discrete actions (up/down/left/right).

          Instances For

            Width of the hidden layer in the actor and critic MLPs.

            Instances For

              PPO rollout horizon (also the training batch size for this example).

              Instances For

                Discount factor used in returns / GAE.

                Instances For

                  GAE(λ) parameter controlling the bias/variance tradeoff of advantage estimates.

                  Instances For

                    Adam learning rate.

                    Instances For

                      Number of PPO optimization epochs per collected rollout batch.

                      Instances For

                        Default maximum number of PPO updates (can be overridden by --updates).

                        Instances For

                          Default evaluation checkpoint interval (can be overridden by --eval-every).

                          Instances For

                            Default evaluation episodes per checkpoint (can be overridden by --eval-episodes).

                            Instances For

                              The observation tensor shape used by this example: [..., nStates] one-hot vectors.

                              Instances For

                                Formal GridWorld model (spec/proof layer) #

                                We define a real-valued GridWorld model and record the proof that its stochastic-MDP view is valid.

                                This proof is not used by the executable training loop directly; it exists so that downstream theorems about the induced MDP can refer to a concrete environment used in an example.

                                A discount factor in [0,1) at the proof layer (), used to build an MDP instance.

                                Instances For

                                  Proof-layer GridWorld instance over rewards/discounts.

                                  Instances For

                                    The induced finite stochastic MDP for gwR is well-formed (0 ≤ γ < 1, row-stochastic transitions).

                                    Lean-native runtime environment #

                                    We implement a Gym-style environment (Spec.RL.Env) whose observations are one-hot vectors over the flattened finite state space Fin nStates.

                                    This keeps the PPO code identical to the Gymnasium example: the policy consumes tensors and produces logits over a finite action set.

                                    Observation function: encode the discrete state as a one-hot vector of length nStates.

                                    Instances For

                                      Absolute difference on natural-number coordinates, returned as a Float.

                                      Instances For

                                        Deterministic GridWorld transition function with dense progress rewards.

                                        The original sparse -1 until terminal reward gave short runs too little learning signal: random rollouts rarely found the goal, so PPO received almost no useful signal. This shaped reward keeps the same goal-reaching task, but gives the learner immediate credit for moving closer to the goal and a small penalty for dithering.

                                        Instances For

                                          Lean-native environment packaged as a Spec.RL.Env for reuse with the generic RL runtime.

                                          Instances For

                                            Trust boundary contract #

                                            Even though this environment is Lean-native, we keep a contract in play to exercise the “checked preconditions” workflow and to keep the interface identical to the external Gymnasium collector.

                                            Model (Actor + Critic) #

                                            Construct the actor network as an MLP mapping one-hot observations to action logits.

                                            Instances For

                                              Construct the critic network as an MLP mapping one-hot observations to a scalar value estimate.

                                              Instances For

                                                Rollout collection (Lean-native environment) #

                                                Collect a fixed-horizon PPO rollout from the Lean-native environment using a checked session.

                                                This is an example-local wrapper around rl.ppo.collectRolloutCheckedSessionWith that packages:

                                                • a Spec.RL.Env as a rl.session.CheckedSession, and
                                                • the (actor, critic) prediction functions at observation shape.
                                                Instances For

                                                  Evaluation #

                                                  Evaluation helpers live in NN.API.rl.eval (runtime module NN.Runtime.RL.Eval).

                                                  Main Training Loop #

                                                  Entry point for lake exe torchlean ppo_gridworld.

                                                  This executable:

                                                  • runs PPO updates against a Lean-native GridWorld environment,
                                                  • periodically evaluates the greedy policy and logs the average return,
                                                  • writes widget-friendly JSON artifacts (training curve, greedy policy snapshot, greedy path snapshot).
                                                  Instances For