GridWorld (Lean-native finite RL environment) #
This file defines a small deterministic GridWorld environment in TorchLean’s spec layer, along with two induced “MDP views”:
- a
Spec.RL.Envview with explicit latent state, - a
Spec.RL.FiniteMDPview (deterministic finite-state discounted MDP), - and a
Spec.RL.FiniteStochastic.MDPview where transitions are represented as one-hot row-stochastic kernels.
This is intended as a Lean-native testbed for RL algorithm development and proofs: small enough to be pleasant to reason about, but shaped like the objects used in standard RL theory.
References (high-level context only):
- Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed.), Chapter 3 (gridworld examples, discounted returns, Bellman operators).
- Puterman, Markov Decision Processes (1994), Chapters 6–7 (finite discounted MDPs).
- Gymnasium and TorchRL are useful API reference points for the environment/rollout shape: https://gymnasium.farama.org/ and https://pytorch.org/rl/
State and Action Types #
We use a coordinate state (row, col) rather than a flattened index. This keeps the
environment’s definition close to the textbook picture and makes “stays in bounds” properties
stateable directly.
The FiniteMDP / FiniteStochastic.MDP views flatten (row, col) to Fin (height * width) using
mathlib’s canonical equivalence Fin height × Fin width ≃ Fin (height * width).
A grid position (row, col) in a height × width grid.
Instances For
Discrete actions for a 4-neighborhood grid: 0=up, 1=down, 2=left, 3=right.
Instances For
Environment Dynamics #
Dynamics are deterministic and border-clamped:
- attempting to move outside the grid keeps the coordinate unchanged.
Reward / termination scheme:
- If the agent is already at the goal, the environment remains terminal (
terminated = true) and yields reward0. - Otherwise, a step yields reward
0iff the successor state is the goal, and reward-1otherwise. Theterminatedflag matches “successor is goal”. truncatedis alwaysfalse(no time-limit semantics in this environment).
GridWorld latent state type (row/col coordinate).
Instances For
GridWorld action type (4-neighborhood moves).
Instances For
The next row when moving one step up (saturating at 0).
Instances For
The next row when moving one step down (clamped at height-1).
Instances For
The next column when moving one step left (saturating at 0).
Instances For
The next column when moving one step right (clamped at width-1).
Instances For
One deterministic step, returning reward and termination flags.
Instances For
Spec.RL.Env view of GridWorld with observations equal to latent states.
Instances For
Finite-state MDP Views #
To connect GridWorld to TorchLean’s finite discounted MDP layer we flatten the coordinate state:
Fin height × Fin width ≃ Fin (height * width).
This is the standard row-major encoding used throughout mathlib.
One-hot transition kernel for a deterministic next state.
Instances For
Finite-stochastic MDP view of GridWorld where P(. | s,a) is a one-hot row.