RL Evaluation Helpers (Executable Runtime) #
This module factors out small non-differentiable evaluation helpers used in executable RL workflows:
- choosing a greedy discrete action (
argmax) from policy logits, and - running episodes to compute the (undiscounted) total reward.
These helpers are intentionally written against the unified checked-session interface
Runtime.RL.Session.CheckedSession, so the same evaluation code can be reused with:
- external Python Gymnasium environments (via
Runtime.RL.Gymnasium), and - Lean-native environments (
Spec.RL.Env) wrapped byRuntime.RL.Session.CheckedSession.ofEnv.
Semantics and references #
- We treat an episode as finished when
terminated || truncated, mirroring Gymnasium's API. See Gymnasium docs: https://gymnasium.farama.org/ - The evaluation metric returned by this module is the undiscounted sum of rewards along the episode, which is the common “episode return” metric used in Gym-style benchmarks. See Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed.), terminology on episodic return and discounting: http://incompleteideas.net/book/the-book-2nd.html
Greedy action selection #
Pick a greedy discrete action from a logits vector by argmax.
If nActions = 0 then Fin nActions is uninhabited, so we require [Fact (0 < nActions)].
When nActions > 0, Metrics.argmax? never returns none; the none branch returns 0 only as
an unreachable totality fallback.
Instances For
Episode evaluation #
Run one episode (up to maxSteps) using a greedy policy derived from policyLogits, and return
the total (undiscounted) reward.
The episode terminates early when the checked transition reports terminated || truncated.
Instances For
Run one greedy-policy episode (up to maxSteps) and record the visited session states.
The returned path includes the initial session state from sess.start, and then each state after
every checked step.
Note: for Session.CheckedSession.ofEnv, Sess is the Lean-native environment's latent state.
For Gymnasium-backed sessions, Sess is the session record (it stores the last observation, etc.),
not the underlying Python environment's internal state.
Instances For
Average episodeTotalReward across multiple episodes, seeding each episode by baseSeed + k.
The mkSession callback is responsible for interpreting the seed:
- for Gymnasium-backed sessions it typically passes the seed to
reset, and - for Lean-native environments it can ignore the seed.
When episodes = 0, this function returns 0 by convention.