TorchLean API

NN.Proofs.RL.MarkovMDP

Markov-Kernel MDP Proofs (Measure Theory) #

This module proves the key discounted Bellman facts for TorchLean's measure-theoretic MDP layer (NN.Spec.RL.MarkovMDP), built on mathlib's Markov kernels.

We formalize the standard argument used in dynamic programming:

References:

noncomputable def Proofs.RL.Markov.valueSupDist {S : Type} [Nonempty S] (values₁ values₂ : Spec.RL.Markov.ValueFunction S) :

Sup distance on value functions, using sSup over pointwise absolute differences.

Instances For
    theorem Proofs.RL.Markov.abs_sub_le_valueSupDist {S : Type} [Nonempty S] (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hBdd : BddAbove (Set.range fun (s : S) => |values₁ s - values₂ s|)) (state : S) :
    |values₁ state - values₂ state| valueSupDist values₁ values₂

    Every pointwise absolute difference is bounded by the sup distance (boundedness assumed).

    theorem Proofs.RL.Markov.bddAbove_abs_sub_of_bddAbove_abs {S : Type} (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (h₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (h₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) :
    BddAbove (Set.range fun (s : S) => |values₁ s - values₂ s|)

    A simple boundedness helper: if both functions are bounded, then their difference is bounded.

    theorem Proofs.RL.Markov.valueSupDist_eq_zero_iff {S : Type} [Nonempty S] (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hBdd₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (hBdd₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) :
    valueSupDist values₁ values₂ = 0 values₁ = values₂

    valueSupDist = 0 iff two (bounded) value functions are equal.

    Because valueSupDist is defined via sSup over pointwise absolute differences, we need a boundedness hypothesis to use le_csSup (see abs_sub_le_valueSupDist).

    theorem Proofs.RL.Markov.expectedNextValue_abs_sub_le {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hMeas₁ : Measurable values₁) (hMeas₂ : Measurable values₂) (hBdd₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (hBdd₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) (state : S) (action : A) :
    |Spec.RL.Markov.expectedNextValue mdp values₁ state action - Spec.RL.Markov.expectedNextValue mdp values₂ state action| valueSupDist values₁ values₂

    Coordinatewise expectation difference is bounded by the sup distance.

    theorem Proofs.RL.Markov.actionValue_abs_sub_le {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hMeas₁ : Measurable values₁) (hMeas₂ : Measurable values₂) (hBdd₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (hBdd₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) (state : S) (action : A) :
    |Spec.RL.Markov.actionValue mdp values₁ state action - Spec.RL.Markov.actionValue mdp values₂ state action| mdp.discount * valueSupDist values₁ values₂

    Bellman state-action values are Lipschitz with constant γ in the sup metric.

    theorem Proofs.RL.Markov.bellmanPolicy_contraction {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) (policy : Spec.RL.Markov.Policy S A) (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hMeas₁ : Measurable values₁) (hMeas₂ : Measurable values₂) (hBdd₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (hBdd₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) :
    valueSupDist (Spec.RL.Markov.bellmanPolicy mdp policy values₁) (Spec.RL.Markov.bellmanPolicy mdp policy values₂) mdp.discount * valueSupDist values₁ values₂

    Bellman expectation for a deterministic policy is a γ-contraction in the sup metric:

    valueSupDist (T^π values₁) (T^π values₂) ≤ γ * valueSupDist values₁ values₂.

    theorem Proofs.RL.Markov.bellmanOptimality_abs_sub_le {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) [Fintype A] [Nonempty A] (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hMeas₁ : Measurable values₁) (hMeas₂ : Measurable values₂) (hBdd₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (hBdd₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) (state : S) :
    |Spec.RL.Markov.bellmanOptimality mdp values₁ state - Spec.RL.Markov.bellmanOptimality mdp values₂ state| mdp.discount * valueSupDist values₁ values₂

    At a fixed state, Bellman optimality is a contraction with modulus γ (finite action space).

    theorem Proofs.RL.Markov.bellmanOptimality_contraction {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) [Fintype A] [Nonempty A] (values₁ values₂ : Spec.RL.Markov.ValueFunction S) (hMeas₁ : Measurable values₁) (hMeas₂ : Measurable values₂) (hBdd₁ : BddAbove (Set.range fun (s : S) => |values₁ s|)) (hBdd₂ : BddAbove (Set.range fun (s : S) => |values₂ s|)) :

    Bellman optimality is a γ-contraction in the sup metric (finite action space):

    valueSupDist (T* values₁) (T* values₂) ≤ γ * valueSupDist values₁ values₂.

    Fixed Point Uniqueness #

    The contraction theorems imply that (when 0 ≤ γ < 1) both Bellman operators have at most one fixed point on the class of bounded measurable value functions. This is the standard “contraction has at most one fixed point” argument from discounted dynamic programming.

    theorem Proofs.RL.Markov.bellmanPolicy_fixedPoint_unique {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) (policy : Spec.RL.Markov.Policy S A) (v w : Spec.RL.Markov.ValueFunction S) (hv : Spec.RL.Markov.bellmanPolicy mdp policy v = v) (hw : Spec.RL.Markov.bellmanPolicy mdp policy w = w) (hMeasV : Measurable v) (hMeasW : Measurable w) (hBddV : BddAbove (Set.range fun (s : S) => |v s|)) (hBddW : BddAbove (Set.range fun (s : S) => |w s|)) :
    v = w

    If the Bellman expectation operator for a fixed deterministic policy has a fixed point, it is unique (among bounded measurable value functions).

    theorem Proofs.RL.Markov.bellmanOptimality_fixedPoint_unique {S A : Type} [MeasurableSpace S] [MeasurableSpace A] [Nonempty S] (mdp : Spec.RL.Markov.MDP S A) (valid : Spec.RL.Markov.Valid mdp) [Fintype A] [Nonempty A] (v w : Spec.RL.Markov.ValueFunction S) (hv : Spec.RL.Markov.bellmanOptimality mdp v = v) (hw : Spec.RL.Markov.bellmanOptimality mdp w = w) (hMeasV : Measurable v) (hMeasW : Measurable w) (hBddV : BddAbove (Set.range fun (s : S) => |v s|)) (hBddW : BddAbove (Set.range fun (s : S) => |w s|)) :
    v = w

    If the Bellman optimality operator has a fixed point, it is unique (finite action space).