GridWorld proof layer #

This module proves a small set of “environment well-formedness” facts for the Lean-native GridWorld defined in NN.Spec.RL.Envs.GridWorld:

the step function keeps coordinates in bounds,
rewards are bounded (reward ∈ [-1, 0]),
the induced one-hot FiniteStochastic.MDP view satisfies the row-stochastic axioms (transition_nonneg and transition_sums_to_one).

These are deliberately lightweight, but they are the kind of lemmas that let downstream RL algorithms state and prove properties without re-opening the environment definition each time.

References:

Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., 2018), gridworld-style examples and dynamic programming chapters: http://incompleteideas.net/book/the-book-2nd.html

source

theorem Proofs.RL.Envs.GridWorld.step_state_in_bounds {width height : ℕ} (gw : Spec.RL.Envs.GridWorld width height) (state : Spec.RL.Envs.GridWorld.State width height) (action : Spec.RL.Envs.GridWorld.Action) :

↑(gw.step state action).state.1 < height ∧ ↑(gw.step state action).state.2 < width

Stepping a GridWorld keeps the successor coordinates in range.

source

theorem Proofs.RL.Envs.GridWorld.step_reward_ge_neg_one {width height : ℕ} (gw : Spec.RL.Envs.GridWorld width height) (state : Spec.RL.Envs.GridWorld.State width height) (action : Spec.RL.Envs.GridWorld.Action) :

-1 ≤ (gw.step state action).reward

GridWorld rewards are bounded below by -1.

source

theorem Proofs.RL.Envs.GridWorld.step_reward_le_zero {width height : ℕ} (gw : Spec.RL.Envs.GridWorld width height) (state : Spec.RL.Envs.GridWorld.State width height) (action : Spec.RL.Envs.GridWorld.Action) :

(gw.step state action).reward ≤ 0

GridWorld rewards are bounded above by 0.

source

theorem Proofs.RL.Envs.GridWorld.step_reward_bounds {width height : ℕ} (gw : Spec.RL.Envs.GridWorld width height) (state : Spec.RL.Envs.GridWorld.State width height) (action : Spec.RL.Envs.GridWorld.Action) :

-1 ≤ (gw.step state action).reward ∧ (gw.step state action).reward ≤ 0

Combined reward bound for convenience (reward ∈ [-1, 0]).

Finite-stochastic MDP validity #

The FiniteStochastic.MDP view of GridWorld represents deterministic transitions as one-hot rows. Here we show this satisfies the standard “row-stochastic” assumptions used throughout the finite-state proof development.

source

theorem Proofs.RL.Envs.GridWorld.toFiniteStochasticMDP_valid {width height : ℕ} (gw : Spec.RL.Envs.GridWorld width height) (hγ₀ : 0 ≤ gw.discount) (hγ₁ : gw.discount < 1) :

Spec.RL.FiniteStochastic.Valid gw.toFiniteStochasticMDP

The one-hot FiniteStochastic.MDP view of GridWorld is valid assuming the discount is in [0,1).

Deterministic / One-Hot Equivalence #

The finite-stochastic GridWorld view is not a different environment. It packages the deterministic successor as a one-hot transition row. The next theorem makes that bridge explicit at the Bellman interface: taking an expectation against that one-hot row is the same as reading the value of the deterministic successor state.

source

theorem Proofs.RL.Envs.GridWorld.toFiniteStochasticMDP_expectedNextValue_eq_toFiniteMDP_successor {width height : ℕ} (gw : Spec.RL.Envs.GridWorld width height) (values : Spec.RL.ValueFunction ℝ (height * width)) (state : Fin (height * width)) (action : Fin 4) :

Spec.RL.FiniteStochastic.expectedNextValue gw.toFiniteStochasticMDP values state action = Spec.RL.valueAt values (gw.toFiniteMDP.step state action).state

Expected next-state value in the one-hot finite-stochastic GridWorld view equals the value of the successor produced by the deterministic finite-MDP view.

source

theorem Proofs.RL.Envs.GridWorld.toFiniteStochasticMDP_actionValue_eq_toFiniteMDP_stateActionValue {width height : ℕ} (gw : Spec.RL.Envs.GridWorld width height) (values : Spec.RL.ValueFunction ℝ (height * width)) (state : Fin (height * width)) (action : Fin 4) :

Spec.RL.FiniteStochastic.actionValue gw.toFiniteStochasticMDP values state action = Spec.RL.stateActionValue gw.toFiniteMDP values state action

The full Bellman state-action value agrees between GridWorld's deterministic finite-MDP view and its one-hot finite-stochastic view.

TorchLean API

NN.Proofs.RL.Envs.GridWorld

GridWorld proof layer #

Finite-stochastic MDP validity #

Deterministic / One-Hot Equivalence #