Finite Stochastic Discounted MDPs #
This module extends TorchLean's finite deterministic MDP layer with finite-state stochastic transitions.
We stay in a deliberately small setting:
- finitely many states and actions,
- real-valued rewards and discount factor,
- row-stochastic transition kernels represented as typed vectors.
This is enough to formalize the Bellman expectation and optimality operators in the standard discounted setting without immediately introducing full measure-theoretic probability.
References:
- Bellman, Dynamic Programming (1957)
- Puterman, Markov Decision Processes (1994)
- Sutton and Barto, Reinforcement Learning: An Introduction
- TorchRL documentation for practical stochastic-environment / rollout APIs: https://pytorch.org/rl/
Naming note:
- The short names in this file live under
Spec.RL.FiniteStochastic. ThusMDP,Valid,actionValue, and the Bellman operators mean the finite stochastic versions, not the deterministic tensor MDPs fromSpec.RL.MDPor the Markov-kernel MDPs fromSpec.RL.MarkovMDP. - We use the file name
FiniteStochasticMDP.leanto make the layer clear in imports, but keep the structure nameMDPinside the namespace. The qualified nameSpec.RL.FiniteStochastic.MDPis clearer at call sites than repeating the layer name twice.
Well-formedness assumptions for a finite stochastic MDP.
- transition_nonneg (state : Fin nStates) (action : Fin nActions) (nextState : Fin nStates) : 0 ≤ (mdp.transitionProb state action).vecGet nextState
Transition probabilities are nonnegative.
- transition_sums_to_one (state : Fin nStates) (action : Fin nActions) : ∑ nextState : Fin nStates, (mdp.transitionProb state action).vecGet nextState = 1
Each transition row sums to
1. Discount factor is nonnegative.
Discount factor is strictly less than
1.
Instances For
Expected next-state value under P(. | s, a).
Instances For
Bellman-style state-action value induced by a candidate value function.
Instances For
All state-action values Q_v(s, ·) for a fixed state and candidate value function.
Instances For
Bellman expectation operator for a deterministic policy.
Instances For
Bellman optimality operator for a finite stochastic MDP.