TorchLean API

Docs Home Guide Examples Graphs

NN.Spec.RL.Envs.GridWorld

GridWorld (Lean-native finite RL environment) #

This file defines a small deterministic GridWorld environment in TorchLean’s spec layer, along with two induced “MDP views”:

a Spec.RL.Env view with explicit latent state,
a Spec.RL.FiniteMDP view (deterministic finite-state discounted MDP),
and a Spec.RL.FiniteStochastic.MDP view where transitions are represented as one-hot row-stochastic kernels.

This is intended as a Lean-native testbed for RL algorithm development and proofs: small enough to be pleasant to reason about, but shaped like the objects used in standard RL theory.

References (high-level context only):

Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed.), Chapter 3 (gridworld examples, discounted returns, Bellman operators).
Puterman, Markov Decision Processes (1994), Chapters 6–7 (finite discounted MDPs).
Gymnasium and TorchRL are useful API reference points for the environment/rollout shape: https://gymnasium.farama.org/ and https://pytorch.org/rl/

State and Action Types #

We use a coordinate state (row, col) rather than a flattened index. This keeps the environment’s definition close to the textbook picture and makes “stays in bounds” properties stateable directly.

The FiniteMDP / FiniteStochastic.MDP views flatten (row, col) to Fin (height * width) using mathlib’s canonical equivalence Fin height × Fin width ≃ Fin (height * width).

@[reducible, inline]

abbrev Spec.RL.Envs.GridPos (width height : ℕ) :

A grid position (row, col) in a height × width grid.

Instances For

@[reducible, inline]

abbrev Spec.RL.Envs.GridAction :

Discrete actions for a 4-neighborhood grid: 0=up, 1=down, 2=left, 3=right.

Instances For

def Spec.RL.Envs.GridAction.up :

Move up.

Instances For

def Spec.RL.Envs.GridAction.down :

Move down.

Instances For

def Spec.RL.Envs.GridAction.left :

Move left.

Instances For

def Spec.RL.Envs.GridAction.right :

Move right.

Instances For

Environment Dynamics #

Dynamics are deterministic and border-clamped:

attempting to move outside the grid keeps the coordinate unchanged.

Reward / termination scheme:

If the agent is already at the goal, the environment remains terminal (terminated = true) and yields reward 0.
Otherwise, a step yields reward 0 iff the successor state is the goal, and reward -1 otherwise. The terminated flag matches “successor is goal”.
truncated is always false (no time-limit semantics in this environment).

structure Spec.RL.Envs.GridWorld (width height : ℕ) :

A small deterministic GridWorld with a start cell, goal cell, and discount factor.

start : GridPos width height
Initial state returned by reset.
goal : GridPos width height
Terminal goal cell.
discount : ℝ
Discount factor γ used by induced MDP views.

Instances For

@[reducible, inline]

abbrev Spec.RL.Envs.GridWorld.State (width height : ℕ) :

GridWorld latent state type (row/col coordinate).

Instances For

@[reducible, inline]

abbrev Spec.RL.Envs.GridWorld.Action :

GridWorld action type (4-neighborhood moves).

Instances For

def Spec.RL.Envs.GridWorld.rowUp {height : ℕ} (row : Fin height) :

Fin height

The next row when moving one step up (saturating at 0).

Instances For

def Spec.RL.Envs.GridWorld.rowDown {height : ℕ} (row : Fin height) :

Fin height

The next row when moving one step down (clamped at height-1).

Instances For

def Spec.RL.Envs.GridWorld.colLeft {width : ℕ} (col : Fin width) :

Fin width

The next column when moving one step left (saturating at 0).

Instances For

def Spec.RL.Envs.GridWorld.colRight {width : ℕ} (col : Fin width) :

Fin width

The next column when moving one step right (clamped at width-1).

Instances For

def Spec.RL.Envs.GridWorld.nextState {width height : ℕ} (state : State width height) (action : Action) :

State width height

Deterministic successor state (border-clamped).

Instances For

def Spec.RL.Envs.GridWorld.step {width height : ℕ} (gw : GridWorld width height) (state : State width height) (action : Action) :

StepResult (State width height) ℝ

One deterministic step, returning reward and termination flags.

Instances For

def Spec.RL.Envs.GridWorld.toEnv {width height : ℕ} (gw : GridWorld width height) :

Env (State width height) Action (State width height) ℝ

Spec.RL.Env view of GridWorld with observations equal to latent states.

Instances For

Finite-state MDP Views #

To connect GridWorld to TorchLean’s finite discounted MDP layer we flatten the coordinate state:

Fin height × Fin width ≃ Fin (height * width).

This is the standard row-major encoding used throughout mathlib.

def Spec.RL.Envs.GridWorld.encode {width height : ℕ} (pos : State width height) :

Fin (height * width)

Flatten a (row,col) grid coordinate into Fin (height * width).

Instances For

def Spec.RL.Envs.GridWorld.decode {width height : ℕ} (i : Fin (height * width)) :

State width height

Unflatten a Fin (height * width) state index into a (row,col) grid coordinate.

Instances For

def Spec.RL.Envs.GridWorld.toFiniteMDP {width height : ℕ} (gw : GridWorld width height) :

FiniteMDP ℝ (height * width) 4

Deterministic finite-state discounted MDP view of GridWorld.

Instances For

def Spec.RL.Envs.GridWorld.oneHot {width height : ℕ} (next : Fin (height * width)) :

Tensor ℝ (Shape.dim (height * width) Shape.scalar)

One-hot transition kernel for a deterministic next state.

Instances For

def Spec.RL.Envs.GridWorld.toFiniteStochasticMDP {width height : ℕ} (gw : GridWorld width height) :

FiniteStochastic.MDP (height * width) 4

Finite-stochastic MDP view of GridWorld where P(. | s,a) is a one-hot row.

Instances For