Finite-MDP Proofs #
This module proves the first foundational theorems for TorchLean's finite discounted MDP layer:
- Bellman policy operators are monotone for nonnegative discounts,
- Bellman optimality operators dominate every policy operator,
- Bellman optimality is itself monotone,
- Bellman policy and Bellman optimality are contractions in the finite sup metric.
The proofs are intentionally about deterministic finite MDPs first. That gives us a clean, trustworthy base to build on before introducing stochastic transitions or richer measure-theoretic machinery.
References:
- Puterman, Markov Decision Processes (1994), discounted dynamic programming chapter: https://onlinelibrary.wiley.com/doi/book/10.1002/9780470316887
- Bertsekas, Dynamic Programming and Optimal Control, Vol. 1 (monotonicity/contraction proofs): http://web.mit.edu/dimitrib/www/dpoc.html
- Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., 2018), Bellman operators in the discounted case: http://incompleteideas.net/book/the-book-2nd.html
Sup Metric for Finite Value Tables #
The finite deterministic and finite stochastic developments intentionally use the same metric: the maximum absolute pointwise difference between two value tables.
Sup distance on finite value functions, using the maximum absolute pointwise difference.
Instances For
The sup distance is nonnegative.
Every pointwise absolute difference is bounded by the sup distance.
stateActionValue is exactly the Bellman backup on the chosen successor state.
Policy Bellman operators read back exactly the selected state-action value.
continueMask is always nonnegative.
A Bellman state-action value is monotone in the candidate value function when γ ≥ 0.
Bellman policy operators are pointwise monotone for nonnegative discounts.
Bellman optimality dominates every particular action.
Bellman optimality dominates Bellman evaluation under any deterministic policy.
Bellman optimality is pointwise monotone for nonnegative discounts.
Contraction Guarantees #
For 0 ≤ γ < 1, deterministic finite Bellman operators are contractions in valueSupDist.
This is the same mathematical guarantee used by value iteration in the stochastic layer; the only
difference is that the next state is a single successor instead of an expectation over a transition
row.
Deterministic state-action Bellman values are Lipschitz with constant γ.
Bellman evaluation for a deterministic policy is a γ-contraction in the finite sup metric.
At a fixed state, Bellman optimality is Lipschitz with constant γ.
Bellman optimality is a γ-contraction in the finite sup metric.