GridWorld proof layer #
This module proves a small set of “environment well-formedness” facts for the Lean-native
GridWorld defined in NN.Spec.RL.Envs.GridWorld:
- the step function keeps coordinates in bounds,
- rewards are bounded (
reward ∈ [-1, 0]), - the induced one-hot
FiniteStochastic.MDPview satisfies the row-stochastic axioms (transition_nonnegandtransition_sums_to_one).
These are deliberately lightweight, but they are the kind of lemmas that let downstream RL algorithms state and prove properties without re-opening the environment definition each time.
References:
- Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., 2018), gridworld-style examples and dynamic programming chapters: http://incompleteideas.net/book/the-book-2nd.html
Stepping a GridWorld keeps the successor coordinates in range.
GridWorld rewards are bounded below by -1.
GridWorld rewards are bounded above by 0.
Combined reward bound for convenience (reward ∈ [-1, 0]).
Finite-stochastic MDP validity #
The FiniteStochastic.MDP view of GridWorld represents deterministic transitions as one-hot rows.
Here we show this satisfies the standard “row-stochastic” assumptions used throughout the
finite-state proof development.
The one-hot FiniteStochastic.MDP view of GridWorld is valid assuming the discount is in [0,1).
Deterministic / One-Hot Equivalence #
The finite-stochastic GridWorld view is not a different environment. It packages the deterministic successor as a one-hot transition row. The next theorem makes that bridge explicit at the Bellman interface: taking an expectation against that one-hot row is the same as reading the value of the deterministic successor state.
Expected next-state value in the one-hot finite-stochastic GridWorld view equals the value of the successor produced by the deterministic finite-MDP view.
The full Bellman state-action value agrees between GridWorld's deterministic finite-MDP view and its one-hot finite-stochastic view.