Measure-Theoretic Discounted MDPs (Markov Kernels) #
This module defines a small, proof-friendly discounted MDP interface on general measurable state and action spaces, using mathlib's Markov kernel formalization.
Why this layer exists (and why it is separate from the finite-state tensor MDPs):
NN.Spec.RL.MDPandNN.Spec.RL.FiniteStochasticMDPare finite and use typed tensors.- Many RL models are naturally measure-theoretic (continuous state spaces, stochastic dynamics).
For these, the right abstraction is a Markov kernel
κ : (S × A) → Measure S.
This file stays intentionally small:
- deterministic policies
π : S → A, - Markov-kernel transitions for next states,
- real-valued rewards, discounting, and an optional per-(state,action) terminal flag.
The proof layer (see NN.Proofs.RL.MarkovMDP) adds the standard discounted Bellman facts,
including monotonicity and contraction in the sup metric for bounded value functions.
References:
- Puterman, Markov Decision Processes (1994), Chapters 6–7.
- Bertsekas, Dynamic Programming and Optimal Control.
- Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed.), Chapter 3.
- mathlib:
ProbabilityTheory.KernelandProbabilityTheory.IsMarkovKernelinMathlib/Probability/Kernel/Defs.lean. - TorchRL documentation is a helpful engineering analogue for stochastic environment interfaces, but this file uses mathlib kernels because the goal here is formal probability semantics: https://pytorch.org/rl/
Naming note:
- The names in this file live under
Spec.RL.Markov. A reference toMarkov.MDPis the measurable-space Markov-kernel object, whileSpec.RL.FiniteMDPandSpec.RL.FiniteStochastic.MDPare the finite tensor layers. - We keep
Policy,ValueFunction, andValidshort inside the namespace because they are the standard mathematical words for this layer, and the namespace carries the disambiguating context.
Value function over an arbitrary measurable state space.
Instances For
Deterministic policy (measurability assumptions live in proofs, not in the raw definition).
Instances For
Discounted MDP specified by a Markov kernel P(. | s, a) plus reward/termination metadata.
- initialState : S
Canonical reset state.
- transition : ProbabilityTheory.Kernel (S × A) S
Transition kernel:
P(. | s, a)as a measure on next states. - reward : S → A → ℝ
Immediate reward
r(s, a). - terminated : S → A → Bool
Task-defined terminal flag for
(s, a). - discount : ℝ
Discount factor.
Instances For
Well-formedness assumptions for a Markov-kernel MDP.
- isMarkov : ProbabilityTheory.IsMarkovKernel mdp.transition
Transition kernel is Markov: each
P(. | s, a)is a probability measure. - measurable_reward : Measurable fun (sa : S × A) => mdp.reward sa.1 sa.2
Reward is measurable as a function of
(s,a). - measurable_terminated : Measurable fun (sa : S × A) => mdp.terminated sa.1 sa.2
Terminal flag is measurable as a function of
(s,a). Discount factor is nonnegative.
Discount factor is strictly less than
1.
Instances For
The transition measure P(. | s, a) (convenience wrapper around the kernel application).
Instances For
Expected next-state value E[v(s_{t+1}) | s_t = s, a_t = a].
Instances For
Bellman-style state-action value induced by a candidate value function.
Instances For
Bellman expectation operator for a deterministic policy.
Instances For
Bellman optimality operator for a finite action space.