RL Evaluation Helpers (Executable Runtime) #

This module factors out small non-differentiable evaluation helpers used in executable RL workflows:

choosing a greedy discrete action (argmax) from policy logits, and
running episodes to compute the (undiscounted) total reward.

These helpers are intentionally written against the unified checked-session interface Runtime.RL.Session.CheckedSession, so the same evaluation code can be reused with:

external Python Gymnasium environments (via Runtime.RL.Gymnasium), and
Lean-native environments (Spec.RL.Env) wrapped by Runtime.RL.Session.CheckedSession.ofEnv.

Semantics and references #

We treat an episode as finished when terminated || truncated, mirroring Gymnasium's API. See Gymnasium docs: https://gymnasium.farama.org/
The evaluation metric returned by this module is the undiscounted sum of rewards along the episode, which is the common “episode return” metric used in Gym-style benchmarks. See Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed.), terminology on episodic return and discounting: http://incompleteideas.net/book/the-book-2nd.html

Greedy action selection #

source

def Runtime.RL.Eval.greedyActionFromLogits {α : Type} [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {nActions : ℕ} [Fact (0 < nActions)] (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :

Fin nActions

Pick a greedy discrete action from a logits vector by argmax.

If nActions = 0 then Fin nActions is uninhabited, so we require [Fact (0 < nActions)]. When nActions > 0, Metrics.argmax? never returns none; the none branch returns 0 only as an unreachable totality fallback.

Instances For

Episode evaluation #

source

def Runtime.RL.Eval.episodeTotalReward {obsShape : Spec.Shape} {nActions : ℕ} [Fact (0 < nActions)] (sess : Session.CheckedSession obsShape nActions) (policyLogits : Spec.Tensor Float obsShape → Spec.Tensor Float (Spec.Shape.dim nActions Spec.Shape.scalar)) (maxSteps : ℕ := 1000) :

IO Float

Run one episode (up to maxSteps) using a greedy policy derived from policyLogits, and return the total (undiscounted) reward.

The episode terminates early when the checked transition reports terminated || truncated.

Instances For

source

def Runtime.RL.Eval.episodeSessPath {obsShape : Spec.Shape} {nActions : ℕ} [Fact (0 < nActions)] (sess : Session.CheckedSession obsShape nActions) (policyLogits : Spec.Tensor Float obsShape → Spec.Tensor Float (Spec.Shape.dim nActions Spec.Shape.scalar)) (maxSteps : ℕ := 1000) :

IO (Array sess.Sess)

Run one greedy-policy episode (up to maxSteps) and record the visited session states.

The returned path includes the initial session state from sess.start, and then each state after every checked step.

Note: for Session.CheckedSession.ofEnv, Sess is the Lean-native environment's latent state. For Gymnasium-backed sessions, Sess is the session record (it stores the last observation, etc.), not the underlying Python environment's internal state.

Instances For

source

def Runtime.RL.Eval.averageEpisodeTotalReward {obsShape : Spec.Shape} {nActions : ℕ} [Fact (0 < nActions)] (mkSession : ℕ → Session.CheckedSession obsShape nActions) (policyLogits : Spec.Tensor Float obsShape → Spec.Tensor Float (Spec.Shape.dim nActions Spec.Shape.scalar)) (baseSeed episodes : ℕ) (maxSteps : ℕ := 1000) :

IO Float

Average episodeTotalReward across multiple episodes, seeding each episode by baseSeed + k.

The mkSession callback is responsible for interpreting the seed:

for Gymnasium-backed sessions it typically passes the seed to reset, and
for Lean-native environments it can ignore the seed.

When episodes = 0, this function returns 0 by convention.

Instances For

TorchLean API

NN.Runtime.RL.Eval

RL Evaluation Helpers (Executable Runtime) #

Semantics and references #

Greedy action selection #

Episode evaluation #