Markov-Kernel MDP Proofs (Measure Theory) #

This module proves the key discounted Bellman facts for TorchLean's measure-theoretic MDP layer (NN.Spec.RL.MarkovMDP), built on mathlib's Markov kernels.

We formalize the standard argument used in dynamic programming:

if two value functions are uniformly close (bounded sup distance), then their Bellman backups are uniformly close,
in particular, the Bellman expectation operator for a fixed deterministic policy is a γ-contraction in the sup metric (on bounded value functions),
for finite action spaces, Bellman optimality is also a γ-contraction in the same metric.

References:

Puterman, Markov Decision Processes (1994), Section 6.2 (discounted case): https://onlinelibrary.wiley.com/doi/book/10.1002/9780470316887
Bertsekas, Dynamic Programming and Optimal Control, Vol. 1 (contraction mapping argument): http://web.mit.edu/dimitrib/www/dpoc.html
Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., 2018), Bellman expectation/optimality operators in the discounted setting: http://incompleteideas.net/book/the-book-2nd.html
mathlib: ProbabilityTheory.Kernel and MeasureTheory integration lemmas such as abs_integral_le_integral_abs and integral_mono. Docs entry point: https://leanprover-community.github.io/mathlib4_docs/Mathlib/Probability/Kernel/Basic.html

source

noncomputable def Proofs.RL.Markov.valueSupDist {S : Type} [Nonempty S] (values₁ values₂ : Spec.RL.Markov.ValueFunction S) :

ℝ

Sup distance on value functions, using sSup over pointwise absolute differences.

Instances For

source

theorem Proofs.RL.Markov.abs_sub_le_valueSupDist {S : Type} [Nonempty S] (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hBdd : BddAbove (Set.range fun (s : S) => |values₁ s - values₂ s|)) (state : S) :

|values₁ state - values₂ state| ≤ valueSupDist values₁ values₂

Every pointwise absolute difference is bounded by the sup distance (boundedness assumed).

source

theorem Proofs.RL.Markov.bddAbove_abs_sub_of_bddAbove_abs {S : Type} (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (h₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (h₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) :

BddAbove (Set.range fun (s : S) => |values₁ s - values₂ s|)

A simple boundedness helper: if both functions are bounded, then their difference is bounded.

source

theorem Proofs.RL.Markov.valueSupDist_eq_zero_iff {S : Type} [Nonempty S] (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hBdd₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (hBdd₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) :

valueSupDist values₁ values₂ = 0 ↔ values₁ = values₂

valueSupDist = 0 iff two (bounded) value functions are equal.

Because valueSupDist is defined via sSup over pointwise absolute differences, we need a boundedness hypothesis to use le_csSup (see abs_sub_le_valueSupDist).

source

theorem Proofs.RL.Markov.expectedNextValue_abs_sub_le {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hMeas₁ : Measurable values₁) (hMeas₂ : Measurable values₂) (hBdd₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (hBdd₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) (state : S) (action : A) :

|Spec.RL.Markov.expectedNextValue mdp values₁ state action - Spec.RL.Markov.expectedNextValue mdp values₂ state action| ≤ valueSupDist values₁ values₂

Coordinatewise expectation difference is bounded by the sup distance.

source

theorem Proofs.RL.Markov.actionValue_abs_sub_le {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hMeas₁ : Measurable values₁) (hMeas₂ : Measurable values₂) (hBdd₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (hBdd₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) (state : S) (action : A) :

|Spec.RL.Markov.actionValue mdp values₁ state action - Spec.RL.Markov.actionValue mdp values₂ state action| ≤ mdp.discount * valueSupDist values₁ values₂

Bellman state-action values are Lipschitz with constant γ in the sup metric.

source

theorem Proofs.RL.Markov.bellmanPolicy_contraction {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) (policy : Spec.RL.Markov.Policy S A) (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hMeas₁ : Measurable values₁) (hMeas₂ : Measurable values₂) (hBdd₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (hBdd₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) :

valueSupDist (Spec.RL.Markov.bellmanPolicy mdp policy values₁) (Spec.RL.Markov.bellmanPolicy mdp policy values₂) ≤ mdp.discount * valueSupDist values₁ values₂

Bellman expectation for a deterministic policy is a γ-contraction in the sup metric:

valueSupDist (T^π values₁) (T^π values₂) ≤ γ * valueSupDist values₁ values₂.

source

theorem Proofs.RL.Markov.bellmanOptimality_abs_sub_le {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) [Fintype A] [Nonempty A] (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hMeas₁ : Measurable values₁) (hMeas₂ : Measurable values₂) (hBdd₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (hBdd₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) (state : S) :

|Spec.RL.Markov.bellmanOptimality mdp values₁ state - Spec.RL.Markov.bellmanOptimality mdp values₂ state| ≤ mdp.discount * valueSupDist values₁ values₂

At a fixed state, Bellman optimality is a contraction with modulus γ (finite action space).

source

theorem Proofs.RL.Markov.bellmanOptimality_contraction {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) [Fintype A] [Nonempty A] (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hMeas₁ : Measurable values₁) (hMeas₂ : Measurable values₂) (hBdd₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (hBdd₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) :

valueSupDist (Spec.RL.Markov.bellmanOptimality mdp values₁) (Spec.RL.Markov.bellmanOptimality mdp values₂) ≤ mdp.discount * valueSupDist values₁ values₂

Bellman optimality is a γ-contraction in the sup metric (finite action space):

valueSupDist (T* values₁) (T* values₂) ≤ γ * valueSupDist values₁ values₂.

Fixed Point Uniqueness #

The contraction theorems imply that (when 0 ≤ γ < 1) both Bellman operators have at most one fixed point on the class of bounded measurable value functions. This is the standard “contraction has at most one fixed point” argument from discounted dynamic programming.

source

theorem Proofs.RL.Markov.bellmanPolicy_fixedPoint_unique {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) (policy : Spec.RL.Markov.Policy S A) (v w : Spec.RL.Markov.ValueFunction S) (hv : Spec.RL.Markov.bellmanPolicy mdp policy v = v) (hw : Spec.RL.Markov.bellmanPolicy mdp policy w = w) (hMeasV : Measurable v) (hMeasW : Measurable w) (hBddV : BddAbove (Set.range fun (s : S) => |v s|)) (hBddW : BddAbove (Set.range fun (s : S) => |w s|)) :

v = w

If the Bellman expectation operator for a fixed deterministic policy has a fixed point, it is unique (among bounded measurable value functions).

source

theorem Proofs.RL.Markov.bellmanOptimality_fixedPoint_unique {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) [Fintype A] [Nonempty A] (v w : Spec.RL.Markov.ValueFunction S) (hv : Spec.RL.Markov.bellmanOptimality mdp v = v) (hw : Spec.RL.Markov.bellmanOptimality mdp w = w) (hMeasV : Measurable v) (hMeasW : Measurable w) (hBddV : BddAbove (Set.range fun (s : S) => |v s|)) (hBddW : BddAbove (Set.range fun (s : S) => |w s|)) :

v = w

If the Bellman optimality operator has a fixed point, it is unique (finite action space).

TorchLean API

NN.Proofs.RL.MarkovMDP

Markov-Kernel MDP Proofs (Measure Theory) #

Fixed Point Uniqueness #