Measure-Theoretic Discounted MDPs (Markov Kernels) #

This module defines a small, proof-friendly discounted MDP interface on general measurable state and action spaces, using mathlib's Markov kernel formalization.

Why this layer exists (and why it is separate from the finite-state tensor MDPs):

NN.Spec.RL.MDP and NN.Spec.RL.FiniteStochasticMDP are finite and use typed tensors.
Many RL models are naturally measure-theoretic (continuous state spaces, stochastic dynamics). For these, the right abstraction is a Markov kernel κ : (S × A) → Measure S.

This file stays intentionally small:

deterministic policies π : S → A,
Markov-kernel transitions for next states,
real-valued rewards, discounting, and an optional per-(state,action) terminal flag.

The proof layer (see NN.Proofs.RL.MarkovMDP) adds the standard discounted Bellman facts, including monotonicity and contraction in the sup metric for bounded value functions.

References:

Puterman, Markov Decision Processes (1994), Chapters 6–7.
Bertsekas, Dynamic Programming and Optimal Control.
Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed.), Chapter 3.
mathlib: ProbabilityTheory.Kernel and ProbabilityTheory.IsMarkovKernel in Mathlib/Probability/Kernel/Defs.lean.
TorchRL documentation is a helpful engineering analogue for stochastic environment interfaces, but this file uses mathlib kernels because the goal here is formal probability semantics: https://pytorch.org/rl/

Naming note:

The names in this file live under Spec.RL.Markov. A reference to Markov.MDP is the measurable-space Markov-kernel object, while Spec.RL.FiniteMDP and Spec.RL.FiniteStochastic.MDP are the finite tensor layers.
We keep Policy, ValueFunction, and Valid short inside the namespace because they are the standard mathematical words for this layer, and the namespace carries the disambiguating context.

source

@[reducible, inline]

abbrev Spec.RL.Markov.ValueFunction (S : Type) :

Type

Value function over an arbitrary measurable state space.

Instances For

source

@[reducible, inline]

abbrev Spec.RL.Markov.Policy (S A : Type) :

Type

Deterministic policy (measurability assumptions live in proofs, not in the raw definition).

Instances For

source

structure Spec.RL.Markov.MDP (S A : Type) [MeasurableSpace S] [MeasurableSpace A] :

Type

Discounted MDP specified by a Markov kernel P(. | s, a) plus reward/termination metadata.

initialState : S
Canonical reset state.
transition : ProbabilityTheory.Kernel (S × A) S
Transition kernel: P(. | s, a) as a measure on next states.
reward : S → A → ℝ
Immediate reward r(s, a).
terminated : S → A → Bool
Task-defined terminal flag for (s, a).
discount : ℝ
Discount factor.

Instances For

source

structure Spec.RL.Markov.Valid {S A : Type} [MeasurableSpace S] [MeasurableSpace A] (mdp : MDP S A) :

Prop

Well-formedness assumptions for a Markov-kernel MDP.

isMarkov : ProbabilityTheory.IsMarkovKernel mdp.transition
Transition kernel is Markov: each P(. | s, a) is a probability measure.
measurable_reward : Measurable fun (sa : S × A) => mdp.reward sa.1 sa.2
Reward is measurable as a function of (s,a).
measurable_terminated : Measurable fun (sa : S × A) => mdp.terminated sa.1 sa.2
Terminal flag is measurable as a function of (s,a).
discount_nonneg : 0 ≤ mdp.discount
Discount factor is nonnegative.
discount_lt_one : mdp.discount < 1
Discount factor is strictly less than 1.

Instances For

source

noncomputable def Spec.RL.Markov.transitionMeasure {S A : Type} [MeasurableSpace S] [MeasurableSpace A] (mdp : MDP S A) (state : S) (action : A) :

MeasureTheory.Measure S

The transition measure P(. | s, a) (convenience wrapper around the kernel application).

Instances For

source

noncomputable def Spec.RL.Markov.expectedNextValue {S A : Type} [MeasurableSpace S] [MeasurableSpace A] (mdp : MDP S A) (values : ValueFunction S) (state : S) (action : A) :

ℝ

Expected next-state value E[v(s_{t+1}) | s_t = s, a_t = a].

Instances For

source

noncomputable def Spec.RL.Markov.actionValue {S A : Type} [MeasurableSpace S] [MeasurableSpace A] (mdp : MDP S A) (values : ValueFunction S) (state : S) (action : A) :

ℝ

Bellman-style state-action value induced by a candidate value function.

Instances For

source

noncomputable def Spec.RL.Markov.bellmanPolicy {S A : Type} [MeasurableSpace S] [MeasurableSpace A] (mdp : MDP S A) (policy : Policy S A) (values : ValueFunction S) :

ValueFunction S

Bellman expectation operator for a deterministic policy.

Instances For

source

noncomputable def Spec.RL.Markov.bellmanOptimality {S A : Type} [MeasurableSpace S] [MeasurableSpace A] (mdp : MDP S A) [Fintype A] [Nonempty A] (values : ValueFunction S) :

ValueFunction S

Bellman optimality operator for a finite action space.

Instances For

TorchLean API

NN.Spec.RL.MarkovMDP

Measure-Theoretic Discounted MDPs (Markov Kernels) #