TorchLean API

NN.Proofs.RL.FiniteStochasticMDP

Finite Stochastic MDP Proofs #

This module proves the key discounted Bellman facts for TorchLean's finite stochastic MDP layer:

The setting is intentionally finite and concrete. The goal is not maximal generality; it is a clean, trustworthy formal base that mirrors the standard textbook RL theory for discounted MDPs.

References:

noncomputable def Proofs.RL.FiniteStochastic.valueSupDist {nStates : } [Fact (0 < nStates)] (values₁ values₂ : Spec.RL.ValueFunction nStates) :

Sup distance on finite value functions, using the maximum absolute pointwise difference.

Instances For
    theorem Proofs.RL.FiniteStochastic.valueSupDist_nonneg {nStates : } [Fact (0 < nStates)] (values₁ values₂ : Spec.RL.ValueFunction nStates) :
    0 valueSupDist values₁ values₂

    The sup distance is nonnegative.

    theorem Proofs.RL.FiniteStochastic.abs_sub_valueAt_le_valueSupDist {nStates : } [Fact (0 < nStates)] (values₁ values₂ : Spec.RL.ValueFunction nStates) (state : Fin nStates) :
    |Spec.RL.valueAt values₁ state - Spec.RL.valueAt values₂ state| valueSupDist values₁ values₂

    Every pointwise absolute difference is bounded by the sup distance.

    theorem Proofs.RL.FiniteStochastic.expectedNextValue_monotone {nStates nActions : } (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (values₁ values₂ : Spec.RL.ValueFunction nStates) (hValues : ∀ (state : Fin nStates), Spec.RL.valueAt values₁ state Spec.RL.valueAt values₂ state) (state : Fin nStates) (action : Fin nActions) :

    Expected next-state value is monotone in the candidate value function.

    theorem Proofs.RL.FiniteStochastic.actionValue_monotone {nStates nActions : } (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (values₁ values₂ : Spec.RL.ValueFunction nStates) (hValues : ∀ (state : Fin nStates), Spec.RL.valueAt values₁ state Spec.RL.valueAt values₂ state) (state : Fin nStates) (action : Fin nActions) :
    Spec.RL.FiniteStochastic.actionValue mdp values₁ state action Spec.RL.FiniteStochastic.actionValue mdp values₂ state action

    Bellman state-action values are monotone in the candidate value function.

    theorem Proofs.RL.FiniteStochastic.bellmanPolicy_monotone {nStates nActions : } (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (policy : Spec.RL.Policy nStates nActions) (values₁ values₂ : Spec.RL.ValueFunction nStates) (hValues : ∀ (state : Fin nStates), Spec.RL.valueAt values₁ state Spec.RL.valueAt values₂ state) (state : Fin nStates) :

    Bellman expectation operators are pointwise monotone.

    theorem Proofs.RL.FiniteStochastic.bellmanOptimality_monotone {nStates nActions : } [Fact (0 < nActions)] (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (values₁ values₂ : Spec.RL.ValueFunction nStates) (hValues : ∀ (state : Fin nStates), Spec.RL.valueAt values₁ state Spec.RL.valueAt values₂ state) (state : Fin nStates) :

    Optimal Bellman operators are pointwise monotone.

    theorem Proofs.RL.FiniteStochastic.expectedNextValue_abs_sub_le {nStates nActions : } [Fact (0 < nStates)] (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (values₁ values₂ : Spec.RL.ValueFunction nStates) (state : Fin nStates) (action : Fin nActions) :
    |Spec.RL.FiniteStochastic.expectedNextValue mdp values₁ state action - Spec.RL.FiniteStochastic.expectedNextValue mdp values₂ state action| valueSupDist values₁ values₂

    Coordinatewise expectation difference is bounded by the sup distance.

    theorem Proofs.RL.FiniteStochastic.actionValue_abs_sub_le {nStates nActions : } [Fact (0 < nStates)] (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (values₁ values₂ : Spec.RL.ValueFunction nStates) (state : Fin nStates) (action : Fin nActions) :
    |Spec.RL.FiniteStochastic.actionValue mdp values₁ state action - Spec.RL.FiniteStochastic.actionValue mdp values₂ state action| mdp.discount * valueSupDist values₁ values₂

    State-action Bellman values are Lipschitz with constant γ in the sup metric.

    theorem Proofs.RL.FiniteStochastic.bellmanPolicy_contraction {nStates nActions : } [Fact (0 < nStates)] (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (policy : Spec.RL.Policy nStates nActions) (values₁ values₂ : Spec.RL.ValueFunction nStates) :
    valueSupDist (Spec.RL.FiniteStochastic.bellmanPolicy mdp policy values₁) (Spec.RL.FiniteStochastic.bellmanPolicy mdp policy values₂) mdp.discount * valueSupDist values₁ values₂

    Bellman expectation is a contraction with modulus γ in the sup metric:

    valueSupDist (T^π values₁) (T^π values₂) ≤ γ * valueSupDist values₁ values₂.

    theorem Proofs.RL.FiniteStochastic.actionValue_le_bellmanOptimality {nStates nActions : } [Fact (0 < nActions)] (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (values : Spec.RL.ValueFunction nStates) (state : Fin nStates) (action : Fin nActions) :

    Every particular action-value is bounded by Bellman optimality.

    theorem Proofs.RL.FiniteStochastic.bellmanPolicy_le_bellmanOptimality {nStates nActions : } [Fact (0 < nActions)] (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (policy : Spec.RL.Policy nStates nActions) (values : Spec.RL.ValueFunction nStates) (state : Fin nStates) :

    Bellman optimality dominates Bellman evaluation under any deterministic policy.

    theorem Proofs.RL.FiniteStochastic.bellmanOptimality_abs_sub_le {nStates nActions : } [Fact (0 < nStates)] [Fact (0 < nActions)] (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (values₁ values₂ : Spec.RL.ValueFunction nStates) (state : Fin nStates) :

    At a fixed state, Bellman optimality is a contraction with modulus γ.

    theorem Proofs.RL.FiniteStochastic.bellmanOptimality_contraction {nStates nActions : } [Fact (0 < nStates)] [Fact (0 < nActions)] (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (values₁ values₂ : Spec.RL.ValueFunction nStates) :

    Bellman optimality is a contraction with modulus γ in the sup metric:

    valueSupDist (T* values₁) (T* values₂) ≤ γ * valueSupDist values₁ values₂.

    Contraction Iterates and Fixed Points #

    The earlier theorems show that (under 0 ≤ γ < 1) the Bellman operators are γ-contractions in the sup metric (valueSupDist).

    This section packages the standard consequences used throughout discounted-RL theory:

    These statements are the formal backbone behind “value iteration converges” style arguments, and they are useful even before we prove existence of a fixed point (existence is typically obtained via a completeness argument, or via an explicit linear-system solution in the finite case).

    theorem Proofs.RL.FiniteStochastic.valueSupDist_eq_zero_iff {nStates : } [Fact (0 < nStates)] (values₁ values₂ : Spec.RL.ValueFunction nStates) :
    valueSupDist values₁ values₂ = 0 values₁ = values₂

    valueSupDist = 0 iff two finite value functions are equal.

    theorem Proofs.RL.FiniteStochastic.bellmanPolicy_iterate_contraction {nStates nActions : } [Fact (0 < nStates)] (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (policy : Spec.RL.Policy nStates nActions) (k : ) (values₁ values₂ : Spec.RL.ValueFunction nStates) :
    valueSupDist ((Spec.RL.FiniteStochastic.bellmanPolicy mdp policy)^[k] values₁) ((Spec.RL.FiniteStochastic.bellmanPolicy mdp policy)^[k] values₂) mdp.discount ^ k * valueSupDist values₁ values₂

    bellmanPolicy iterates are geometric contractions in valueSupDist.

    theorem Proofs.RL.FiniteStochastic.bellmanPolicy_fixedPoint_unique {nStates nActions : } [Fact (0 < nStates)] (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (policy : Spec.RL.Policy nStates nActions) (v w : Spec.RL.ValueFunction nStates) (hv : Spec.RL.FiniteStochastic.bellmanPolicy mdp policy v = v) (hw : Spec.RL.FiniteStochastic.bellmanPolicy mdp policy w = w) :
    v = w

    If a discounted Bellman policy operator has a fixed point, it is unique.

    This is the standard “contraction has at most one fixed point” argument.

    theorem Proofs.RL.FiniteStochastic.bellmanPolicy_iterate_error_to_fixedPoint {nStates nActions : } [Fact (0 < nStates)] (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (policy : Spec.RL.Policy nStates nActions) (v vStar : Spec.RL.ValueFunction nStates) (hvStar : Spec.RL.FiniteStochastic.bellmanPolicy mdp policy vStar = vStar) (k : ) :

    Error bound to a fixed point: iterating the Bellman policy operator reduces sup-distance geometrically (γ^k).

    theorem Proofs.RL.FiniteStochastic.bellmanOptimality_iterate_contraction {nStates nActions : } [Fact (0 < nStates)] [Fact (0 < nActions)] (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (k : ) (values₁ values₂ : Spec.RL.ValueFunction nStates) :

    bellmanOptimality iterates are geometric contractions in valueSupDist.

    If a discounted Bellman optimality operator has a fixed point, it is unique.

    This is the “contraction has at most one fixed point” argument for T*.

    theorem Proofs.RL.FiniteStochastic.bellmanOptimality_iterate_error_to_fixedPoint {nStates nActions : } [Fact (0 < nStates)] [Fact (0 < nActions)] (mdp : Spec.RL.FiniteStochastic.MDP nStates nActions) (valid : Spec.RL.FiniteStochastic.Valid mdp) (v vStar : Spec.RL.ValueFunction nStates) (hvStar : Spec.RL.FiniteStochastic.bellmanOptimality mdp vStar = vStar) (k : ) :

    Error bound to a fixed point: iterating Bellman optimality reduces sup-distance geometrically.