TorchLean API

NN.Spec.RL.MarkovMDP

Measure-Theoretic Discounted MDPs (Markov Kernels) #

This module defines a small, proof-friendly discounted MDP interface on general measurable state and action spaces, using mathlib's Markov kernel formalization.

Why this layer exists (and why it is separate from the finite-state tensor MDPs):

This file stays intentionally small:

The proof layer (see NN.Proofs.RL.MarkovMDP) adds the standard discounted Bellman facts, including monotonicity and contraction in the sup metric for bounded value functions.

References:

Naming note:

@[reducible, inline]

Value function over an arbitrary measurable state space.

Instances For
    @[reducible, inline]

    Deterministic policy (measurability assumptions live in proofs, not in the raw definition).

    Instances For

      Discounted MDP specified by a Markov kernel P(. | s, a) plus reward/termination metadata.

      • initialState : S

        Canonical reset state.

      • transition : ProbabilityTheory.Kernel (S × A) S

        Transition kernel: P(. | s, a) as a measure on next states.

      • reward : SA

        Immediate reward r(s, a).

      • terminated : SABool

        Task-defined terminal flag for (s, a).

      • discount :

        Discount factor.

      Instances For
        structure Spec.RL.Markov.Valid {S A : Type} [MeasurableSpace S] [MeasurableSpace A] (mdp : MDP S A) :

        Well-formedness assumptions for a Markov-kernel MDP.

        • Transition kernel is Markov: each P(. | s, a) is a probability measure.

        • measurable_reward : Measurable fun (sa : S × A) => mdp.reward sa.1 sa.2

          Reward is measurable as a function of (s,a).

        • measurable_terminated : Measurable fun (sa : S × A) => mdp.terminated sa.1 sa.2

          Terminal flag is measurable as a function of (s,a).

        • discount_nonneg : 0 mdp.discount

          Discount factor is nonnegative.

        • discount_lt_one : mdp.discount < 1

          Discount factor is strictly less than 1.

        Instances For
          noncomputable def Spec.RL.Markov.transitionMeasure {S A : Type} [MeasurableSpace S] [MeasurableSpace A] (mdp : MDP S A) (state : S) (action : A) :

          The transition measure P(. | s, a) (convenience wrapper around the kernel application).

          Instances For
            noncomputable def Spec.RL.Markov.expectedNextValue {S A : Type} [MeasurableSpace S] [MeasurableSpace A] (mdp : MDP S A) (values : ValueFunction S) (state : S) (action : A) :

            Expected next-state value E[v(s_{t+1}) | s_t = s, a_t = a].

            Instances For
              noncomputable def Spec.RL.Markov.actionValue {S A : Type} [MeasurableSpace S] [MeasurableSpace A] (mdp : MDP S A) (values : ValueFunction S) (state : S) (action : A) :

              Bellman-style state-action value induced by a candidate value function.

              Instances For
                noncomputable def Spec.RL.Markov.bellmanPolicy {S A : Type} [MeasurableSpace S] [MeasurableSpace A] (mdp : MDP S A) (policy : Policy S A) (values : ValueFunction S) :

                Bellman expectation operator for a deterministic policy.

                Instances For
                  noncomputable def Spec.RL.Markov.bellmanOptimality {S A : Type} [MeasurableSpace S] [MeasurableSpace A] (mdp : MDP S A) [Fintype A] [Nonempty A] (values : ValueFunction S) :

                  Bellman optimality operator for a finite action space.

                  Instances For