TorchLean API

NN.Proofs.RL.MDP

Finite-MDP Proofs #

This module proves the first foundational theorems for TorchLean's finite discounted MDP layer:

The proofs are intentionally about deterministic finite MDPs first. That gives us a clean, trustworthy base to build on before introducing stochastic transitions or richer measure-theoretic machinery.

References:

Sup Metric for Finite Value Tables #

The finite deterministic and finite stochastic developments intentionally use the same metric: the maximum absolute pointwise difference between two value tables.

noncomputable def Proofs.RL.MDP.valueSupDist {nStates : } [Fact (0 < nStates)] (values₁ values₂ : Spec.RL.ValueFunction nStates) :

Sup distance on finite value functions, using the maximum absolute pointwise difference.

Instances For
    theorem Proofs.RL.MDP.valueSupDist_nonneg {nStates : } [Fact (0 < nStates)] (values₁ values₂ : Spec.RL.ValueFunction nStates) :
    0 valueSupDist values₁ values₂

    The sup distance is nonnegative.

    theorem Proofs.RL.MDP.abs_sub_valueAt_le_valueSupDist {nStates : } [Fact (0 < nStates)] (values₁ values₂ : Spec.RL.ValueFunction nStates) (state : Fin nStates) :
    |Spec.RL.valueAt values₁ state - Spec.RL.valueAt values₂ state| valueSupDist values₁ values₂

    Every pointwise absolute difference is bounded by the sup distance.

    theorem Proofs.RL.MDP.stateActionValue_eq {nStates nActions : } (mdp : Spec.RL.FiniteMDP nStates nActions) (values : Spec.RL.ValueFunction nStates) (state : Fin nStates) (action : Fin nActions) :
    Spec.RL.stateActionValue mdp values state action = have out := mdp.step state action; out.reward + mdp.discount * Spec.RL.continueMask out.terminated * Spec.RL.valueAt values out.state

    stateActionValue is exactly the Bellman backup on the chosen successor state.

    theorem Proofs.RL.MDP.valueAt_bellmanPolicy {nStates nActions : } (mdp : Spec.RL.FiniteMDP nStates nActions) (policy : Spec.RL.Policy nStates nActions) (values : Spec.RL.ValueFunction nStates) (state : Fin nStates) :
    Spec.RL.valueAt (Spec.RL.bellmanPolicy mdp policy values) state = Spec.RL.stateActionValue mdp values state (policy state)

    Policy Bellman operators read back exactly the selected state-action value.

    continueMask is always nonnegative.

    theorem Proofs.RL.MDP.stateActionValue_monotone {nStates nActions : } (mdp : Spec.RL.FiniteMDP nStates nActions) (values₁ values₂ : Spec.RL.ValueFunction nStates) ( : 0 mdp.discount) (hValues : ∀ (state : Fin nStates), Spec.RL.valueAt values₁ state Spec.RL.valueAt values₂ state) (state : Fin nStates) (action : Fin nActions) :
    Spec.RL.stateActionValue mdp values₁ state action Spec.RL.stateActionValue mdp values₂ state action

    A Bellman state-action value is monotone in the candidate value function when γ ≥ 0.

    theorem Proofs.RL.MDP.bellmanPolicy_monotone {nStates nActions : } (mdp : Spec.RL.FiniteMDP nStates nActions) (policy : Spec.RL.Policy nStates nActions) (values₁ values₂ : Spec.RL.ValueFunction nStates) ( : 0 mdp.discount) (hValues : ∀ (state : Fin nStates), Spec.RL.valueAt values₁ state Spec.RL.valueAt values₂ state) (state : Fin nStates) :
    Spec.RL.valueAt (Spec.RL.bellmanPolicy mdp policy values₁) state Spec.RL.valueAt (Spec.RL.bellmanPolicy mdp policy values₂) state

    Bellman policy operators are pointwise monotone for nonnegative discounts.

    theorem Proofs.RL.MDP.stateActionValue_le_bellmanOptimality {nStates nActions : } [Fact (0 < nActions)] (mdp : Spec.RL.FiniteMDP nStates nActions) (values : Spec.RL.ValueFunction nStates) (state : Fin nStates) (action : Fin nActions) :
    Spec.RL.stateActionValue mdp values state action Spec.RL.valueAt (Spec.RL.bellmanOptimality mdp values) state

    Bellman optimality dominates every particular action.

    theorem Proofs.RL.MDP.bellmanPolicy_le_bellmanOptimality {nStates nActions : } [Fact (0 < nActions)] (mdp : Spec.RL.FiniteMDP nStates nActions) (policy : Spec.RL.Policy nStates nActions) (values : Spec.RL.ValueFunction nStates) (state : Fin nStates) :

    Bellman optimality dominates Bellman evaluation under any deterministic policy.

    theorem Proofs.RL.MDP.bellmanOptimality_monotone {nStates nActions : } [Fact (0 < nActions)] (mdp : Spec.RL.FiniteMDP nStates nActions) (values₁ values₂ : Spec.RL.ValueFunction nStates) ( : 0 mdp.discount) (hValues : ∀ (state : Fin nStates), Spec.RL.valueAt values₁ state Spec.RL.valueAt values₂ state) (state : Fin nStates) :

    Bellman optimality is pointwise monotone for nonnegative discounts.

    Contraction Guarantees #

    For 0 ≤ γ < 1, deterministic finite Bellman operators are contractions in valueSupDist. This is the same mathematical guarantee used by value iteration in the stochastic layer; the only difference is that the next state is a single successor instead of an expectation over a transition row.

    theorem Proofs.RL.MDP.stateActionValue_abs_sub_le {nStates nActions : } [Fact (0 < nStates)] (mdp : Spec.RL.FiniteMDP nStates nActions) (values₁ values₂ : Spec.RL.ValueFunction nStates) (hγ₀ : 0 mdp.discount) (state : Fin nStates) (action : Fin nActions) :
    |Spec.RL.stateActionValue mdp values₁ state action - Spec.RL.stateActionValue mdp values₂ state action| mdp.discount * valueSupDist values₁ values₂

    Deterministic state-action Bellman values are Lipschitz with constant γ.

    theorem Proofs.RL.MDP.bellmanPolicy_contraction {nStates nActions : } [Fact (0 < nStates)] (mdp : Spec.RL.FiniteMDP nStates nActions) (policy : Spec.RL.Policy nStates nActions) (values₁ values₂ : Spec.RL.ValueFunction nStates) (hγ₀ : 0 mdp.discount) :
    valueSupDist (Spec.RL.bellmanPolicy mdp policy values₁) (Spec.RL.bellmanPolicy mdp policy values₂) mdp.discount * valueSupDist values₁ values₂

    Bellman evaluation for a deterministic policy is a γ-contraction in the finite sup metric.

    theorem Proofs.RL.MDP.bellmanOptimality_abs_sub_le {nStates nActions : } [Fact (0 < nStates)] [Fact (0 < nActions)] (mdp : Spec.RL.FiniteMDP nStates nActions) (values₁ values₂ : Spec.RL.ValueFunction nStates) (hγ₀ : 0 mdp.discount) (state : Fin nStates) :
    |Spec.RL.valueAt (Spec.RL.bellmanOptimality mdp values₁) state - Spec.RL.valueAt (Spec.RL.bellmanOptimality mdp values₂) state| mdp.discount * valueSupDist values₁ values₂

    At a fixed state, Bellman optimality is Lipschitz with constant γ.

    theorem Proofs.RL.MDP.bellmanOptimality_contraction {nStates nActions : } [Fact (0 < nStates)] [Fact (0 < nActions)] (mdp : Spec.RL.FiniteMDP nStates nActions) (values₁ values₂ : Spec.RL.ValueFunction nStates) (hγ₀ : 0 mdp.discount) :
    valueSupDist (Spec.RL.bellmanOptimality mdp values₁) (Spec.RL.bellmanOptimality mdp values₂) mdp.discount * valueSupDist values₁ values₂

    Bellman optimality is a γ-contraction in the finite sup metric.