TorchLean API

NN.Proofs.RL.Envs.GridWorld

GridWorld proof layer #

This module proves a small set of “environment well-formedness” facts for the Lean-native GridWorld defined in NN.Spec.RL.Envs.GridWorld:

These are deliberately lightweight, but they are the kind of lemmas that let downstream RL algorithms state and prove properties without re-opening the environment definition each time.

References:

theorem Proofs.RL.Envs.GridWorld.step_state_in_bounds {width height : } (gw : Spec.RL.Envs.GridWorld width height) (state : Spec.RL.Envs.GridWorld.State width height) (action : Spec.RL.Envs.GridWorld.Action) :
(gw.step state action).state.1 < height (gw.step state action).state.2 < width

Stepping a GridWorld keeps the successor coordinates in range.

theorem Proofs.RL.Envs.GridWorld.step_reward_ge_neg_one {width height : } (gw : Spec.RL.Envs.GridWorld width height) (state : Spec.RL.Envs.GridWorld.State width height) (action : Spec.RL.Envs.GridWorld.Action) :
-1 (gw.step state action).reward

GridWorld rewards are bounded below by -1.

theorem Proofs.RL.Envs.GridWorld.step_reward_le_zero {width height : } (gw : Spec.RL.Envs.GridWorld width height) (state : Spec.RL.Envs.GridWorld.State width height) (action : Spec.RL.Envs.GridWorld.Action) :
(gw.step state action).reward 0

GridWorld rewards are bounded above by 0.

theorem Proofs.RL.Envs.GridWorld.step_reward_bounds {width height : } (gw : Spec.RL.Envs.GridWorld width height) (state : Spec.RL.Envs.GridWorld.State width height) (action : Spec.RL.Envs.GridWorld.Action) :
-1 (gw.step state action).reward (gw.step state action).reward 0

Combined reward bound for convenience (reward ∈ [-1, 0]).

Finite-stochastic MDP validity #

The FiniteStochastic.MDP view of GridWorld represents deterministic transitions as one-hot rows. Here we show this satisfies the standard “row-stochastic” assumptions used throughout the finite-state proof development.

The one-hot FiniteStochastic.MDP view of GridWorld is valid assuming the discount is in [0,1).

Deterministic / One-Hot Equivalence #

The finite-stochastic GridWorld view is not a different environment. It packages the deterministic successor as a one-hot transition row. The next theorem makes that bridge explicit at the Bellman interface: taking an expectation against that one-hot row is the same as reading the value of the deterministic successor state.

theorem Proofs.RL.Envs.GridWorld.toFiniteStochasticMDP_expectedNextValue_eq_toFiniteMDP_successor {width height : } (gw : Spec.RL.Envs.GridWorld width height) (values : Spec.RL.ValueFunction (height * width)) (state : Fin (height * width)) (action : Fin 4) :

Expected next-state value in the one-hot finite-stochastic GridWorld view equals the value of the successor produced by the deterministic finite-MDP view.

theorem Proofs.RL.Envs.GridWorld.toFiniteStochasticMDP_actionValue_eq_toFiniteMDP_stateActionValue {width height : } (gw : Spec.RL.Envs.GridWorld width height) (values : Spec.RL.ValueFunction (height * width)) (state : Fin (height * width)) (action : Fin 4) :

The full Bellman state-action value agrees between GridWorld's deterministic finite-MDP view and its one-hot finite-stochastic view.