Reinforcement-Learning Environments #

This module provides a small, proof-friendly environment interface inspired by Gym/Gymnasium:

reset returns an initial observation and state,
stepGym returns (observation, reward, terminated, truncated, state),
helper functions build state traces and transition rollouts from an action sequence.

Unlike a typical Python RL environment, this interface is purely functional and keeps the hidden state explicit. That makes it much easier to state and prove safety / invariant properties.

References:

Gymnasium API docs (reset/step, terminated vs truncated): https://gymnasium.farama.org/
TorchRL documentation (environment transforms, rollout collection, and TensorDict interface): https://pytorch.org/rl/
Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed.): http://incompleteideas.net/book/the-book-2nd.html

source

structure Spec.RL.StepResult (State : Type u) (Reward : Type v) :

Type (max u v)

Result of stepping an environment from one state with one action.

state : State
Next latent state.
reward : Reward
Immediate reward.
terminated : Bool
Task-defined terminal flag.
truncated : Bool
Time-limit / truncation flag.

Instances For

source

def Spec.RL.StepResult.done {State : Type u} {Reward : Type z} (r : StepResult State Reward) :

Bool

Gymnasium-style done flag: an episode is done if it is terminated or truncated.

Instances For

source

structure Spec.RL.ObservedTransition (Observation : Type u) (Action : Type v) (Reward : Type w) :

Type (max (max u v) w)

Rollout record that stores observations on both sides of a step.

observation : Observation
Observation before the action.
action : Action
Action taken.
reward : Reward
Reward returned by the environment.
nextObservation : Observation
Observation after the step.
terminated : Bool
Task-defined terminal flag.
truncated : Bool
Time-limit / truncation flag.

Instances For

source

structure Spec.RL.Env (State : Type u) (Action : Type v) (Observation : Type w) (Reward : Type z) :

Type (max (max (max u v) w) z)

Pure Gym-style environment with explicit latent state.

initialState : State
Initial latent state used by reset.
observe : State → Observation
Observation function from latent states.
step : State → Action → StepResult State Reward
Single-step transition function.

Instances For

source

def Spec.RL.reset {State : Type u} {Action : Type v} {Observation : Type w} {Reward : Type z} (env : Env State Action Observation Reward) :

Observation × State

Gym-style reset: return the initial observation and latent state.

Instances For

source

def Spec.RL.stepGym {State : Type u} {Action : Type v} {Observation : Type w} {Reward : Type z} (env : Env State Action Observation Reward) (state : State) (action : Action) :

Observation × Reward × Bool × Bool × State

Gym-style step result: (nextObservation, reward, terminated, truncated, nextState).

Instances For

source

def Spec.RL.evolveFrom {State : Type u} {Action : Type v} {Observation : Type w} {Reward : Type z} (env : Env State Action Observation Reward) :

State → List Action → State

Final latent state reached after executing a list of actions.

Instances For

source

def Spec.RL.evolve {State : Type u} {Action : Type v} {Observation : Type w} {Reward : Type z} (env : Env State Action Observation Reward) (actions : List Action) :

State

Final latent state reached from the environment's initial state.

Instances For

source

def Spec.RL.statesFrom {State : Type u} {Action : Type v} {Observation : Type w} {Reward : Type z} (env : Env State Action Observation Reward) :

State → List Action → List State

State trace that records the initial state and every successor state.

Instances For

source

def Spec.RL.states {State : Type u} {Action : Type v} {Observation : Type w} {Reward : Type z} (env : Env State Action Observation Reward) (actions : List Action) :

List State

State trace from the environment's initial state.

Instances For

source

def Spec.RL.rolloutFrom {State : Type u} {Action : Type v} {Observation : Type w} {Reward : Type z} (env : Env State Action Observation Reward) :

State → List Action → List (ObservedTransition Observation Action Reward)

Observed transition rollout from an explicit initial state.

Instances For

source

def Spec.RL.rollout {State : Type u} {Action : Type v} {Observation : Type w} {Reward : Type z} (env : Env State Action Observation Reward) (actions : List Action) :

List (ObservedTransition Observation Action Reward)

Observed transition rollout from the environment's initial state.

Instances For

source

structure Spec.RL.SafeEnv (State : Type u) (Action : Type v) (Observation : Type w) (Reward : Type z) :

Type (max (max (max u v) w) z)

Environment with an invariant and an action-validity predicate.

toEnv : Env State Action Observation Reward
Underlying environment dynamics.
Invariant : State → Prop
State invariant we want preserved along valid executions.
ActionOk : State → Action → Prop
Legality / admissibility of actions at a state.
init_safe : self.Invariant self.toEnv.initialState
Reset starts in a safe state.
step_safe {state : State} {action : Action} : self.Invariant state → self.ActionOk state action → self.Invariant (self.toEnv.step state action).state
One valid step preserves the invariant.

Instances For

source

def Spec.RL.SafeEnv.actionPathOk {State : Type u} {Action : Type v} {Observation : Type w} {Reward : Type z} (env : SafeEnv State Action Observation Reward) :

State → List Action → Prop

Path validity for a list of actions from a given starting state.

Instances For

source

theorem Spec.RL.SafeEnv.evolveFrom_safe {State : Type u} {Action : Type v} {Observation : Type w} {Reward : Type z} (env : SafeEnv State Action Observation Reward) (state : State) (actions : List Action) :

env.Invariant state → env.actionPathOk state actions → env.Invariant (evolveFrom env.toEnv state actions)

If a path is action-valid and starts from an invariant state, then its final state is invariant.

This is the basic safety theorem users want from SafeEnv: once we specify the one-step invariant, valid rollouts inherit it automatically.

source

theorem Spec.RL.SafeEnv.evolve_safe {State : Type u} {Action : Type v} {Observation : Type w} {Reward : Type z} (env : SafeEnv State Action Observation Reward) (actions : List Action) (hPath : env.actionPathOk env.toEnv.initialState actions) :

env.Invariant (evolve env.toEnv actions)

Any valid action path from reset remains safe at the final state.

TorchLean API

NN.Spec.RL.Environment

Reinforcement-Learning Environments #