PPO on Atari Pong (RAM Observations) (Executable Example) #
This example mirrors NN/Examples/Models/RL/PPOCartPole.lean, but targets an Atari game via the
Arcade Learning Environment (ALE) registered into Gymnasium as ALE/Pong-v5.
Why "RAM" observations?
- Pixel-based Atari PPO is absolutely doable, but a JSON-lines subprocess bridge is not the right
transport if you want millions of steps/hour. RAM observations (
obs_type="ram", shape128) keep the bridge compact and make this run viable as a native Lean executable.
The key TorchLean interface remains the same:
- Algorithm math (GAE, PPO clipped objective) is Lean definitions.
- Autograd program (PPO loss) is a TorchLean backend-generic program (CPU or CUDA).
- Trust boundary is explicit: every externally sampled transition is checked by
Runtime.RL.Boundary.Contractbefore it can influence training.
Dependencies #
Atari/ALE environments require ale-py and a recent gymnasium:
python3 -m pip install --user 'gymnasium>=1.0' ale-py
CLI flags #
--cuda: run the Torch backend on CUDA (requires building with-K cuda=true).--updates <n>: number of PPO updates to run.--eval-every <n>: evaluate the greedy policy everynupdates.--eval-episodes <n>: number of evaluation episodes per checkpoint.--eval-max-steps <n>: maximum steps per evaluation episode.--log <path>: write the widget log JSON to a custom path.
This module is optional. It depends on a compatible external ALE/Gymnasium installation and is not
part of the default torchlean runner quick-check list.
Dependency setup:
python3 -m pip install --user 'gymnasium>=1.0' ale-py
Artifacts:
- Writes
data/rl/ppo_pong_ram_trainlog.jsonby default (override with--log). - Visualize it in the editor via
NN/Examples/RL/PPOPongRamView.lean.
References (primary):
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
- Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
- Williams, "Simple statistical gradient-following algorithms for connectionist reinforcement learning" (REINFORCE, 1992): https://doi.org/10.1007/BF00992696
- Machado et al., "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems" (2018): https://arxiv.org/abs/1709.06009
- ALE docs (environment catalogue and versioned
ALE/...-v5ids): https://ale.farama.org/ - Gymnasium API docs (reset/step,
terminatedvstruncated): https://gymnasium.farama.org/
Name used in CLI error messages and banners when the optional runner is wired in.
Instances For
Help text for the optional ALE/Pong RAM PPO runner.
Instances For
Configuration #
Atari environment id passed to the Python subprocess.
Instances For
Relative path to the Python Gymnasium bridge script (spawned as a subprocess).
Instances For
Pong RAM observation dimension.
Gymnasium exposes RAM as Box(0, 255, (128,), uint8) when obs_type="ram".
Instances For
Number of discrete actions in Pong under ALE's reduced action set.
Instances For
PPO rollout horizon (also the training batch size for this run).
Instances For
Discount factor used in returns / GAE.
Instances For
GAE(λ) parameter controlling the bias/variance tradeoff of advantage estimates.
Instances For
Adam learning rate used for the Pong RAM actor-critic update.
Instances For
Number of PPO optimization epochs per collected rollout batch.
Instances For
Default maximum number of PPO updates (override with --updates).
Instances For
Default evaluation checkpoint interval (override with --eval-every).
Instances For
Default evaluation episodes per checkpoint (override with --eval-episodes).
Instances For
The observation tensor shape used by this run: [..., stateDim].
Instances For
Model (Actor + Critic) #
We use MLPs over RAM. For pixel observations you would typically use a CNN (see
NN.GraphSpec.Models.TorchLean.Cnn) and wrap the environment with Atari preprocessing.
Construct the actor network as an MLP mapping RAM observations to action logits.
Instances For
Construct the critic network as an MLP mapping RAM observations to a scalar value estimate.
Instances For
Gymnasium / ALE bridge #
We request RAM observations by passing {"obs_type": "ram"} to gym.make through the bridge's
--make-kwargs option. The server also auto-registers ale_py when envId starts with ALE/.
Run the smallest useful ALE smoke check.
This exercises the same Gymnasium subprocess, ALE registration, RAM observation shape handshake, and Lean-side boundary contract as the full PPO runner, but avoids collecting a 128-step rollout.