Public RL Facade #
This module exposes TorchLean's reinforcement-learning helper surface under the public
NN.API.rl.* namespace.
Design intent:
- keep the public API smaller and easier to browse than the full runtime namespace,
- mirror the existing
NN.API.*facade pattern, - expose typed RL math while keeping environment/trainer integration separate.
References (background and terminology):
- Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed.): http://incompleteideas.net/book/the-book-2nd.html
- Puterman, Markov Decision Processes (finite discounted MDPs): https://doi.org/10.1002/9780470316887
- Gymnasium API docs (reset/step,
terminatedvstruncated): https://gymnasium.farama.org/
Differentiable policy-gradient losses over TorchLean backend references.
The pure exports above are algebra over concrete spec tensors. These helpers are the training-time counterpart: they build scalar losses from backend refs, so the same formulas can run through eager or compiled autograd.
Training Logs (Widgets and Examples) #
TorchLean does not aim to be a full “trainer framework”, but many executable examples want to:
- evaluate a scalar metric every
Nupdates, - append it to a curve, and
- write a small JSON file for widgets (
#train_log_file_view).
This namespace re-exports the small, stable log types and JSON IO helpers.
Casting to Other Scalar Backends #
The trust-boundary checker (Runtime.RL.Boundary) validates rollouts in terms of host Float
because that is what our lightweight JSON interchange format uses.
Most RL math in TorchLean is scalar-polymorphic ([Context α]), so it is often convenient to
cast a validated Float rollout into the chosen runtime scalar backend:
Float(fast host execution),IEEE32Exec(executable bit-level float32),- any other backend that supports
Runtime.ofFloat.
Cast a Float observation tensor into a runtime scalar backend α.
Instances For
Cast a validated Float transition into a runtime scalar backend α.
Instances For
Cast a whole rollout (array of transitions) into a runtime scalar backend α.
Instances For
Load a rollout JSON file, validate it with the boundary contract, then cast to scalar α.
Instances For
Split a concatenated actor-critic parameter pack into (actorParams, criticParams).
PPO examples often bundle actor and critic parameters as actor.params ++ critic.params to update
them with a single optimizer step (ppoActorCriticScalarModuleDef). When we want to run just the
actor for evaluation or action selection, we need to recover the actor slice.
This helper keeps example code from reaching into the long proved TList.splitAppend path.