PPO on Atari Pong (RAM Observations) (Executable Demo) #

This example mirrors NN/Examples/Models/RL/PPOCartPole.lean, but targets an Atari game via the Arcade Learning Environment (ALE) registered into Gymnasium as ALE/Pong-v5.

Why "RAM" observations?

Pixel-based Atari PPO is absolutely doable, but a JSON-lines subprocess bridge is not the right transport if you want millions of steps/hour. RAM observations (obs_type="ram", shape 128) keep the bridge lightweight and make this run viable as a native Lean executable.

The key TorchLean interface remains the same:

Algorithm math (GAE, PPO clipped objective) is Lean definitions.
Autograd program (PPO loss) is a TorchLean backend-generic program (CPU or CUDA).
Trust boundary is explicit: every externally sampled transition is checked by Runtime.RL.Boundary.Contract before it can influence training.

Dependencies #

Atari/ALE environments require ale-py and a recent gymnasium:

python3 -m pip install --user gymnasium>=1.0 ale-py

CLI flags #

--cuda: run the Torch backend on CUDA (requires building with -K cuda=true).
--updates <n>: number of PPO updates to run.
--eval-every <n>: evaluate the greedy policy every n updates.
--eval-episodes <n>: number of evaluation episodes per checkpoint.
--eval-max-steps <n>: maximum steps per evaluation episode.
--log <path>: write the widget log JSON to a custom path.

Run (from the repo root):

python3 -m pip install --user gymnasium>=1.0 ale-py
lake exe torchlean ppo_pong_ram
lake build -R -K cuda=true && lake exe torchlean ppo_pong_ram --cuda

Artifacts:

Writes data/rl/ppo_pong_ram_trainlog.json by default (override with --log).
Visualize it in the editor via NN/Examples/RL/PPOPongRamView.lean.

References (primary):

Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
Williams, "Simple statistical gradient-following algorithms for connectionist reinforcement learning" (REINFORCE, 1992): https://doi.org/10.1007/BF00992696
Machado et al., "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems" (2018): https://arxiv.org/abs/1709.06009
ALE docs (environment catalogue and versioned ALE/...-v5 ids): https://ale.farama.org/
Gymnasium API docs (reset/step, terminated vs truncated): https://gymnasium.farama.org/

source

def NN.Examples.Models.RL.PPOPongRam.exeName :

String

Name of this executable target (used in CLI error messages and banners).

Instances For

Configuration #

source

def NN.Examples.Models.RL.PPOPongRam.envId :

String

Atari environment id passed to the Python subprocess.

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.gymServerScript :

String

Relative path to the Python Gymnasium bridge script (spawned as a subprocess).

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.stateDim :

ℕ

Pong RAM observation dimension.

Gymnasium exposes RAM as Box(0, 255, (128,), uint8) when obs_type="ram".

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.nActions :

ℕ

Number of discrete actions in Pong under ALE's reduced action set.

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.hiddenDim :

ℕ

Width of the hidden layer in the actor and critic MLPs.

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.horizon :

ℕ

PPO rollout horizon (also the training batch size for this run).

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.gamma :

Float

Discount factor used in returns / GAE.

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.lam :

Float

GAE(λ) parameter controlling the bias/variance tradeoff of advantage estimates.

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.lr :

Float

Adam learning rate.

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.updateEpochs :

ℕ

Number of PPO optimization epochs per collected rollout batch.

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.updatesMax :

ℕ

Default maximum number of PPO updates (override with --updates).

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.evalEvery :

ℕ

Default evaluation checkpoint interval (override with --eval-every).

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.evalEpisodes :

ℕ

Default evaluation episodes per checkpoint (override with --eval-episodes).

Instances For

source

instance NN.Examples.Models.RL.PPOPongRam.instFactLtNatOfNatHorizon :

Fact (0 < horizon)

source

instance NN.Examples.Models.RL.PPOPongRam.instFactLtNatOfNatNActions :

Fact (0 < nActions)

source

def NN.Examples.Models.RL.PPOPongRam.obsShape :

Shape

The observation tensor shape used by this run: [..., stateDim].

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.pfxBatch :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.sStateBatch :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.sLogitsBatch :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.sScalarBatch :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.sValueBatch :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.sState1 :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.sLogits1 :

Shape

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.sValue1 :

Shape

Instances For

Model (Actor + Critic) #

We use MLPs over RAM. For pixel observations you would typically use a CNN (see NN.GraphSpec.Models.TorchLean.Cnn) and wrap the environment with Atari preprocessing.

source

def NN.Examples.Models.RL.PPOPongRam.modelCfg :

API.nn.models.PPOActorCriticConfig

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.actorMk (pfx : Shape) :

API.nn.M (API.nn.Sequential (Spec.Shape.appendDim pfx stateDim) (Spec.Shape.appendDim pfx nActions))

Construct the actor network as an MLP mapping RAM observations to action logits.

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.criticMk (pfx : Shape) :

API.nn.M (API.nn.Sequential (Spec.Shape.appendDim pfx stateDim) (Spec.Shape.appendDim pfx 1))

Construct the critic network as an MLP mapping RAM observations to a scalar value estimate.

Instances For

Gymnasium / ALE bridge #

We request RAM observations by passing {"obs_type": "ram"} to gym.make through the bridge's --make-kwargs option. The server also auto-registers ale_py when envId starts with ALE/.

source

def NN.Examples.Models.RL.PPOPongRam.makeKwargs :

List (String × Lean.Json)

Instances For

source

def NN.Examples.Models.RL.PPOPongRam.contract :

Runtime.RL.Boundary.Contract obsShape nActions

Instances For

Main Training Loop #

source

def NN.Examples.Models.RL.PPOPongRam.main (args : List String) :

IO UInt32

Instances For