TorchLean API

NN.Runtime.RL.Eval

RL Evaluation Helpers (Executable Runtime) #

This module factors out small non-differentiable evaluation helpers used in executable RL workflows:

These helpers are intentionally written against the unified checked-session interface Runtime.RL.Session.CheckedSession, so the same evaluation code can be reused with:

Semantics and references #

Greedy action selection #

def Runtime.RL.Eval.greedyActionFromLogits {α : Type} [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {nActions : } [Fact (0 < nActions)] (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :
Fin nActions

Pick a greedy discrete action from a logits vector by argmax.

If nActions = 0 then Fin nActions is uninhabited, so we require [Fact (0 < nActions)]. When nActions > 0, Metrics.argmax? never returns none; the none branch returns 0 only as an unreachable totality fallback.

Instances For

    Episode evaluation #

    def Runtime.RL.Eval.episodeTotalReward {obsShape : Spec.Shape} {nActions : } [Fact (0 < nActions)] (sess : Session.CheckedSession obsShape nActions) (policyLogits : Spec.Tensor Float obsShapeSpec.Tensor Float (Spec.Shape.dim nActions Spec.Shape.scalar)) (maxSteps : := 1000) :

    Run one episode (up to maxSteps) using a greedy policy derived from policyLogits, and return the total (undiscounted) reward.

    The episode terminates early when the checked transition reports terminated || truncated.

    Instances For
      def Runtime.RL.Eval.episodeSessPath {obsShape : Spec.Shape} {nActions : } [Fact (0 < nActions)] (sess : Session.CheckedSession obsShape nActions) (policyLogits : Spec.Tensor Float obsShapeSpec.Tensor Float (Spec.Shape.dim nActions Spec.Shape.scalar)) (maxSteps : := 1000) :
      IO (Array sess.Sess)

      Run one greedy-policy episode (up to maxSteps) and record the visited session states.

      The returned path includes the initial session state from sess.start, and then each state after every checked step.

      Note: for Session.CheckedSession.ofEnv, Sess is the Lean-native environment's latent state. For Gymnasium-backed sessions, Sess is the session record (it stores the last observation, etc.), not the underlying Python environment's internal state.

      Instances For
        def Runtime.RL.Eval.averageEpisodeTotalReward {obsShape : Spec.Shape} {nActions : } [Fact (0 < nActions)] (mkSession : Session.CheckedSession obsShape nActions) (policyLogits : Spec.Tensor Float obsShapeSpec.Tensor Float (Spec.Shape.dim nActions Spec.Shape.scalar)) (baseSeed episodes : ) (maxSteps : := 1000) :

        Average episodeTotalReward across multiple episodes, seeding each episode by baseSeed + k.

        The mkSession callback is responsible for interpreting the seed:

        • for Gymnasium-backed sessions it typically passes the seed to reset, and
        • for Lean-native environments it can ignore the seed.

        When episodes = 0, this function returns 0 by convention.

        Instances For