RL Trust Boundary (External Rollouts) #
TorchLean’s RL update rules (Bellman backups, returns/GAE, PPO objectives, etc.) are defined in Lean and can therefore be reasoned about and proved correct inside Lean. When you collect experience with an external environment (for example Python Gymnasium), you cross a trust boundary:
- we do not (and generally cannot) prove the external environment satisfies Markov / measurability / boundedness assumptions, and
- we do not prove the external runtime returns numerically well-behaved floating-point values.
This module implements a practical middle ground: a strict, explicit contract for externally provided rollouts and a checker that turns “assumptions” into “checked preconditions”.
The boundary emits Spec.RL.ObservedTransition as the validated output type. This
lets downstream training code share one common input type for both:
- Lean-native environments via
Spec.RL.rolloutFrom, and - external rollouts after passing this contract check.
References #
- Gymnasium API (
reset/step,terminatedvstruncated): https://gymnasium.farama.org/ - The original Gym API paper (background on the env interface): https://arxiv.org/abs/1606.01540
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
- Trust-boundary pattern used elsewhere in TorchLean (e.g. the Arb oracle):
NN.Floats.Arb.
Basic numeric checks #
We keep checks in Bool form and then use Except String wrappers for nice error messages.
This makes it easy to integrate with IO loaders (JSON, sockets, shared-memory, ...).
true iff a Float is neither NaN nor ±Inf.
Instances For
Apply a scalar boolean predicate to every entry of a tensor.
Instances For
true iff every entry of the tensor is finite.
Instances For
true iff every entry of the tensor lies in [lo, hi].
Instances For
Contract for external discrete-action rollouts #
This contract is intentionally narrow:
- it does not try to verify the environment dynamics,
- it checks only syntactic and numeric sanity properties that can be validated locally,
- it is designed to fail fast with a human-readable error.
Contract for validating externally provided Gym-style transitions.
- checkObsFinite : Bool
Check that observations contain no NaNs/Infs.
- checkRewardFinite : Bool
Check that rewards contain no NaNs/Infs.
Optional observation range check
[lo, hi].Optional reward range check
[lo, hi].- requireExclusiveDoneFlags : Bool
Optional strictness: reject steps with both
terminated=trueandtruncated=true.
Instances For
Instances For
Validated transition type for discrete-action rollouts.
Instances For
Gymnasium-style done flag: terminated || truncated.
Instances For
Checkers #
Internal check helpers #
These small routines are factored out so the exported checkers can stay readable and can assemble good error messages.
Check that every tensor entry is finite, producing an error tagged with field.
Instances For
Check that every tensor entry lies in [lo, hi], producing an error tagged with field.
Instances For
Validate one observation tensor against the observation-related part of a Contract.
This is mainly used to validate reset observations coming from external Gymnasium-like bridges.
Instances For
Validate one reward against the reward-related part of a Contract.
Instances For
Validate the Gymnasium-style done flags against a Contract.
Instances For
Validate a transition when the action is already range-checked (Fin nActions).
This avoids re-checking the action range at call sites that already work with Fin actions.
Instances For
Validate one observed transition against a Contract.
Instances For
Proposition-level contract #
The checkers above are executable and return Except String ... for good runtime error messages.
For formal reasoning we also want a Prop-level view that states exactly what was checked.
The main bridge lemma is in NN/Proofs/RL/Boundary.lean:
Proofs.RL.Boundary.contractHolds_of_checkTransitionFin_eq_ok turns a successful runtime check
into ContractHolds.
Proposition-level version of the observation checks performed by checkObservation.
This is intentionally “syntactic”: it only states the locally checkable numeric/shape sanity conditions, not any semantic assumptions about the environment dynamics.
Instances For
Proposition-level version of the reward checks performed by checkReward.
Instances For
Proposition-level version of the done-flag check performed by checkDoneFlags.
terminated and truncated follow Gymnasium’s API: terminated means the environment reached a
terminal state, while truncated indicates an external truncation such as a time-limit. See the
Gymnasium Env.step documentation:
Instances For
ContractHolds c t means the external transition t satisfies the Prop-level trust-boundary
contract induced by c.
- doneFlags : DoneFlagsHolds c t.terminated t.truncated
- reward : RewardHolds c t.reward
- observation : ObservationHolds c t.observation
- nextObservation : ObservationHolds c t.nextObservation
Instances For
Alias: Prop-level validity of a transition under c.