PPO on Atari Pong (RAM Observations) (Executable Demo) #
This example mirrors NN/Examples/Models/RL/PPOCartPole.lean, but targets an Atari game via the
Arcade Learning Environment (ALE) registered into Gymnasium as ALE/Pong-v5.
Why "RAM" observations?
- Pixel-based Atari PPO is absolutely doable, but a JSON-lines subprocess bridge is not the right
transport if you want millions of steps/hour. RAM observations (
obs_type="ram", shape128) keep the bridge lightweight and make this run viable as a native Lean executable.
The key TorchLean interface remains the same:
- Algorithm math (GAE, PPO clipped objective) is Lean definitions.
- Autograd program (PPO loss) is a TorchLean backend-generic program (CPU or CUDA).
- Trust boundary is explicit: every externally sampled transition is checked by
Runtime.RL.Boundary.Contractbefore it can influence training.
Dependencies #
Atari/ALE environments require ale-py and a recent gymnasium:
python3 -m pip install --user gymnasium>=1.0 ale-py
CLI flags #
--cuda: run the Torch backend on CUDA (requires building with-K cuda=true).--updates <n>: number of PPO updates to run.--eval-every <n>: evaluate the greedy policy everynupdates.--eval-episodes <n>: number of evaluation episodes per checkpoint.--eval-max-steps <n>: maximum steps per evaluation episode.--log <path>: write the widget log JSON to a custom path.
Run (from the repo root):
python3 -m pip install --user gymnasium>=1.0 ale-py
lake exe torchlean ppo_pong_ram
lake build -R -K cuda=true && lake exe torchlean ppo_pong_ram --cuda
Artifacts:
- Writes
data/rl/ppo_pong_ram_trainlog.jsonby default (override with--log). - Visualize it in the editor via
NN/Examples/RL/PPOPongRamView.lean.
References (primary):
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
- Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
- Williams, "Simple statistical gradient-following algorithms for connectionist reinforcement learning" (REINFORCE, 1992): https://doi.org/10.1007/BF00992696
- Machado et al., "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems" (2018): https://arxiv.org/abs/1709.06009
- ALE docs (environment catalogue and versioned
ALE/...-v5ids): https://ale.farama.org/ - Gymnasium API docs (reset/step,
terminatedvstruncated): https://gymnasium.farama.org/
Name of this executable target (used in CLI error messages and banners).
Instances For
Configuration #
Atari environment id passed to the Python subprocess.
Instances For
Relative path to the Python Gymnasium bridge script (spawned as a subprocess).
Instances For
Pong RAM observation dimension.
Gymnasium exposes RAM as Box(0, 255, (128,), uint8) when obs_type="ram".
Instances For
Number of discrete actions in Pong under ALE's reduced action set.
Instances For
PPO rollout horizon (also the training batch size for this run).
Instances For
Discount factor used in returns / GAE.
Instances For
GAE(λ) parameter controlling the bias/variance tradeoff of advantage estimates.
Instances For
Number of PPO optimization epochs per collected rollout batch.
Instances For
Default maximum number of PPO updates (override with --updates).
Instances For
Default evaluation checkpoint interval (override with --eval-every).
Instances For
Default evaluation episodes per checkpoint (override with --eval-episodes).
Instances For
The observation tensor shape used by this run: [..., stateDim].
Instances For
Model (Actor + Critic) #
We use MLPs over RAM. For pixel observations you would typically use a CNN (see
NN.GraphSpec.Models.TorchLean.Cnn) and wrap the environment with Atari preprocessing.
Construct the actor network as an MLP mapping RAM observations to action logits.
Instances For
Construct the critic network as an MLP mapping RAM observations to a scalar value estimate.
Instances For
Gymnasium / ALE bridge #
We request RAM observations by passing {"obs_type": "ram"} to gym.make through the bridge's
--make-kwargs option. The server also auto-registers ale_py when envId starts with ALE/.