GridWorld (Lean-native finite RL environment) #
This file defines a small deterministic GridWorld environment in TorchLean’s spec layer, along with two induced “MDP views”:
- a
Spec.RL.Envview with explicit latent state, - a
Spec.RL.FiniteMDPview (deterministic finite-state discounted MDP), - and a
Spec.RL.FiniteStochastic.MDPview where transitions are represented as one-hot row-stochastic kernels.
This is intended as a Lean-native testbed for RL algorithm development and proofs: small enough to keep the transition proof small, but shaped like the objects used in standard RL theory.
References (high-level context only):
- Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed.), Chapter 3 (gridworld examples, discounted returns, Bellman operators).
- Puterman, Markov Decision Processes (1994), Chapters 6–7 (finite discounted MDPs).
- Gymnasium and TorchRL are useful API reference points for the environment/rollout shape: https://gymnasium.farama.org/ and https://pytorch.org/rl/
State and Action Types #
We use a coordinate state (row, col) rather than a flattened index. The environment definition
matches the textbook picture, and “stays in bounds” properties can be stated directly.
The FiniteMDP / FiniteStochastic.MDP views flatten (row, col) to Fin (height * width) using
mathlib’s canonical equivalence Fin height × Fin width ≃ Fin (height * width).
A grid position (row, col) in a height × width grid.
Instances For
Discrete actions for a 4-neighborhood grid: 0=up, 1=down, 2=left, 3=right.
Instances For
Grid action that decreases the row coordinate.
Instances For
Grid action that increases the row coordinate.
Instances For
Grid action that decreases the column coordinate.
Instances For
Grid action that increases the column coordinate.
Instances For
Environment Dynamics #
Dynamics are deterministic and border-clamped:
- attempting to move outside the grid keeps the coordinate unchanged.
Reward / termination scheme:
- If the agent is already at the goal, the environment remains terminal (
terminated = true) and yields reward0. - Otherwise, a step yields reward
0iff the successor state is the goal, and reward-1otherwise. Theterminatedflag matches “successor is goal”. truncatedis alwaysfalse(no time-limit semantics in this environment).
GridWorld latent state type (row/col coordinate).
Instances For
GridWorld action type (4-neighborhood moves).
Instances For
The next row when moving one step up (saturating at 0).
Instances For
The next row when moving one step down (clamped at height-1).
Instances For
The next column when moving one step left (saturating at 0).
Instances For
The next column when moving one step right (clamped at width-1).
Instances For
One deterministic step, returning reward and termination flags.
Instances For
Spec.RL.Env view of GridWorld with observations equal to latent states.
Instances For
Finite-state MDP Views #
To connect GridWorld to TorchLean’s finite discounted MDP layer we flatten the coordinate state:
Fin height × Fin width ≃ Fin (height * width).
This is the standard row-major encoding used throughout mathlib.
One-hot transition kernel for a deterministic next state.
Instances For
Finite-stochastic MDP view of GridWorld where P(. | s,a) is a one-hot row.