Markov-Kernel MDP Proofs (Measure Theory) #
This module proves the key discounted Bellman facts for TorchLean's measure-theoretic MDP layer
(NN.Spec.RL.MarkovMDP), built on mathlib's Markov kernels.
We formalize the standard argument used in dynamic programming:
- if two value functions are uniformly close (bounded sup distance), then their Bellman backups are uniformly close,
- in particular, the Bellman expectation operator for a fixed deterministic policy is a
γ-contraction in the sup metric (on bounded value functions), - for finite action spaces, Bellman optimality is also a
γ-contraction in the same metric.
References:
- Puterman, Markov Decision Processes (1994), Section 6.2 (discounted case): https://onlinelibrary.wiley.com/doi/book/10.1002/9780470316887
- Bertsekas, Dynamic Programming and Optimal Control, Vol. 1 (contraction mapping argument): http://web.mit.edu/dimitrib/www/dpoc.html
- Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., 2018), Bellman expectation/optimality operators in the discounted setting: http://incompleteideas.net/book/the-book-2nd.html
- mathlib:
ProbabilityTheory.KernelandMeasureTheoryintegration lemmas such asabs_integral_le_integral_absandintegral_mono. Docs entry point: https://leanprover-community.github.io/mathlib4_docs/Mathlib/Probability/Kernel/Basic.html
Sup distance on value functions, using sSup over pointwise absolute differences.
Instances For
Every pointwise absolute difference is bounded by the sup distance (boundedness assumed).
A simple boundedness helper: if both functions are bounded, then their difference is bounded.
valueSupDist = 0 iff two (bounded) value functions are equal.
Because valueSupDist is defined via sSup over pointwise absolute differences, we need a
boundedness hypothesis to use le_csSup (see abs_sub_le_valueSupDist).
Coordinatewise expectation difference is bounded by the sup distance.
Bellman state-action values are Lipschitz with constant γ in the sup metric.
Bellman expectation for a deterministic policy is a γ-contraction in the sup metric:
valueSupDist (T^π values₁) (T^π values₂) ≤ γ * valueSupDist values₁ values₂.
At a fixed state, Bellman optimality is a contraction with modulus γ (finite action space).
Bellman optimality is a γ-contraction in the sup metric (finite action space):
valueSupDist (T* values₁) (T* values₂) ≤ γ * valueSupDist values₁ values₂.
Fixed Point Uniqueness #
The contraction theorems imply that (when 0 ≤ γ < 1) both Bellman operators have at most one
fixed point on the class of bounded measurable value functions. This is the standard “contraction
has at most one fixed point” argument from discounted dynamic programming.
If the Bellman expectation operator for a fixed deterministic policy has a fixed point, it is unique (among bounded measurable value functions).
If the Bellman optimality operator has a fixed point, it is unique (finite action space).