TorchLean API

NN.Examples.Models.RL.PPOCartPole

PPO on Gymnasium CartPole (Executable Demo) #

This example is intentionally “small but complete”:

More concretely:

CLI flags #

Run (from the repo root):

python3 -m pip install --user 'gymnasium>=1.0'
lake exe torchlean ppo_cartpole
lake build -R -K cuda=true && lake exe torchlean ppo_cartpole --cuda

Artifacts:

What this run does (and does not) guarantee #

References (primary):

Name of this executable target (used in CLI error messages and banners).

Instances For

    Configuration #

    We keep this example discrete-action and small (CartPole) so it runs quickly in a native Lean executable.

    Gymnasium environment id passed to the Python subprocess (see Gymnasium docs for supported ids).

    Instances For

      Relative path to the Python Gymnasium bridge script (spawned as a subprocess).

      Instances For

        Observation vector dimension for CartPole (Gymnasium reports 4 floats).

        Instances For

          Number of discrete actions for CartPole (left/right).

          Instances For

            Width of the hidden layer in the actor and critic MLPs.

            Instances For

              PPO rollout horizon (also the training batch size for this run).

              Instances For

                Discount factor used in returns / GAE.

                Instances For

                  GAE(λ) parameter controlling the bias/variance tradeoff of advantage estimates.

                  Instances For

                    Adam learning rate.

                    Instances For

                      Number of PPO optimization epochs per collected rollout batch.

                      Instances For

                        Maximum number of PPO updates (training stops early if the "solved" criterion triggers).

                        Instances For

                          Evaluate (greedy policy) every evalEvery PPO updates.

                          Instances For

                            Number of evaluation episodes per checkpoint.

                            Instances For

                              Stop early if average return meets/exceeds this threshold.

                              Instances For

                                The observation tensor shape used by this run: [..., stateDim].

                                Instances For

                                  Model (Actor + Critic) #

                                  We use the public API.nn facade, which provides "prefix-shape preserving" layers: if x has shape [..., inDim], nn.linear inDim outDim (pfx := ...) maps it to [..., outDim].

                                  Construct the actor network as an MLP mapping observations to action logits.

                                  Instances For

                                    Construct the critic network as an MLP mapping observations to a scalar value estimate.

                                    Instances For

                                      Gymnasium Bridge #

                                      We talk to a small Python helper (scripts/rl/gymnasium_server.py) using the reusable runtime bridge in Runtime.RL.Gymnasium (exposed as rl.gym.*).

                                      The Lean-side trust-boundary contract (rl.boundary.Contract) is enforced on every step.

                                      Evaluation #

                                      Evaluation helpers live in NN.API.rl.eval (runtime module NN.Runtime.RL.Eval).

                                      Main Training Loop #

                                      Entry point for lake exe torchlean ppo_cartpole.

                                      This executable:

                                      • launches a Python Gymnasium subprocess for CartPole-v1,
                                      • collects checked rollouts under rl.boundary.Contract,
                                      • performs PPO updates on the Torch backend (CPU or CUDA via --cuda),
                                      • writes a widget-friendly training curve JSON (default: data/rl/ppo_cartpole_trainlog.json).
                                      Instances For