Finite Stochastic MDP Proofs #
This module proves the key discounted Bellman facts for TorchLean's finite stochastic MDP layer:
- monotonicity of Bellman expectation and Bellman optimality,
- Bellman expectation is a contraction in the sup metric,
- Bellman optimality is also a contraction in the sup metric.
The setting is intentionally finite and concrete. The goal is not maximal generality; it is a clean, trustworthy formal base that mirrors the standard textbook RL theory for discounted MDPs.
References:
- Puterman, Markov Decision Processes (1994), discounted case: https://onlinelibrary.wiley.com/doi/book/10.1002/9780470316887
- Bertsekas, Dynamic Programming and Optimal Control, Vol. 1 (contraction mapping argument): http://web.mit.edu/dimitrib/www/dpoc.html
- Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., 2018), Bellman expectation/optimality operators: http://incompleteideas.net/book/the-book-2nd.html
Sup distance on finite value functions, using the maximum absolute pointwise difference.
Instances For
The sup distance is nonnegative.
Every pointwise absolute difference is bounded by the sup distance.
Expected next-state value is monotone in the candidate value function.
Bellman state-action values are monotone in the candidate value function.
Bellman expectation operators are pointwise monotone.
Optimal Bellman operators are pointwise monotone.
Coordinatewise expectation difference is bounded by the sup distance.
State-action Bellman values are Lipschitz with constant γ in the sup metric.
Bellman expectation is a contraction with modulus γ in the sup metric:
valueSupDist (T^π values₁) (T^π values₂) ≤ γ * valueSupDist values₁ values₂.
Every particular action-value is bounded by Bellman optimality.
Bellman optimality dominates Bellman evaluation under any deterministic policy.
At a fixed state, Bellman optimality is a contraction with modulus γ.
Bellman optimality is a contraction with modulus γ in the sup metric:
valueSupDist (T* values₁) (T* values₂) ≤ γ * valueSupDist values₁ values₂.
Contraction Iterates and Fixed Points #
The earlier theorems show that (under 0 ≤ γ < 1) the Bellman operators are γ-contractions in the
sup metric (valueSupDist).
This section packages the standard consequences used throughout discounted-RL theory:
- iterating a contraction shrinks distances geometrically (
γ^k), - fixed points are unique,
- the error to a fixed point decays geometrically under iteration.
These statements are the formal backbone behind “value iteration converges” style arguments, and they are useful even before we prove existence of a fixed point (existence is typically obtained via a completeness argument, or via an explicit linear-system solution in the finite case).
valueSupDist = 0 iff two finite value functions are equal.
bellmanPolicy iterates are geometric contractions in valueSupDist.
If a discounted Bellman policy operator has a fixed point, it is unique.
This is the standard “contraction has at most one fixed point” argument.
Error bound to a fixed point: iterating the Bellman policy operator reduces sup-distance
geometrically (γ^k).
bellmanOptimality iterates are geometric contractions in valueSupDist.
If a discounted Bellman optimality operator has a fixed point, it is unique.
This is the “contraction has at most one fixed point” argument for T*.
Error bound to a fixed point: iterating Bellman optimality reduces sup-distance geometrically.