Tabular Reinforcement Learning #

This module implements typed, total update rules for classic finite-state / finite-action RL:

TD(0) state-value learning,
SARSA,
Expected SARSA,
Q-learning,
Double Q-learning.

The updates operate on shape-indexed vectors / Q-tables, so they fit naturally into the rest of TorchLean's typed tensor surface.

Primary references:

Sutton, "Learning to Predict by the Methods of Temporal Differences" (1988): https://doi.org/10.1023/A:1022633531479
Rummery and Niranjan, "On-line Q-learning using connectionist systems" (1994) (SARSA precursor): https://mi.eng.cam.ac.uk/reports/svr-ftp/auto-pdf/rummery_tr166.pdf
Sutton, "Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding" (1996) (SARSA / function approximation example): http://www.cs.ualberta.ca/~sutton/papers/sutton-96.pdf
Watkins and Dayan, "Q-learning" (1992): https://doi.org/10.1007/BF00992698
van Hasselt, "Double Q-learning" (2010): https://proceedings.neurips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html
Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed.): http://incompleteideas.net/book/the-book-2nd.html

source

def Runtime.RL.Tabular.actionRow {α : Type} {nStates nActions : ℕ} (q : Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))) (state : Fin nStates) :

Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)

Extract the action-value row Q[s, :].

Instances For

source

def Runtime.RL.Tabular.maxActionValue {α : Type} [Context α] {nStates nActions : ℕ} (q : Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))) (state : Fin nStates) :

Max action value at a state, defaulting to 0 for empty action spaces.

Instances For

source

def Runtime.RL.Tabular.greedyAction? {α : Type} [Context α] {nStates nActions : ℕ} (q : Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))) (state : Fin nStates) :

Option (Fin nActions)

Greedy action at a state, if the action space is nonempty.

Instances For

source

def Runtime.RL.Tabular.expectedActionValue {α : Type} [Context α] {nStates nActions : ℕ} (q : Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))) (state : Fin nStates) (policy : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :

Expected action value under an explicit policy over the next state.

Instances For

source

def Runtime.RL.Tabular.td0Update {α : Type} [Context α] {nStates : ℕ} (values : Spec.Tensor α (Spec.Shape.dim nStates Spec.Shape.scalar)) (state nextState : Fin nStates) (reward gamma stepSize : α) (done : Bool := false) :

Spec.Tensor α (Spec.Shape.dim nStates Spec.Shape.scalar)

One TD(0) update for a state-value table.

Instances For

source

def Runtime.RL.Tabular.sarsaTarget {α : Type} [Context α] {nStates nActions : ℕ} (q : Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))) (nextState : Fin nStates) (nextAction : Fin nActions) (reward gamma : α) (done : Bool := false) :

SARSA target r + γ Q(s', a').

Instances For

source

def Runtime.RL.Tabular.expectedSarsaTarget {α : Type} [Context α] {nStates nActions : ℕ} (q : Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))) (nextState : Fin nStates) (nextPolicy : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (reward gamma : α) (done : Bool := false) :

Expected SARSA target r + γ * E_{a' ~ π(.|s')}[Q(s', a')].

Instances For

source

def Runtime.RL.Tabular.qLearningTarget {α : Type} [Context α] {nStates nActions : ℕ} (q : Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))) (nextState : Fin nStates) (reward gamma : α) (done : Bool := false) :

Q-learning target r + γ max_a Q(s', a).

Instances For

source

def Runtime.RL.Tabular.doubleQTarget {α : Type} [Context α] {nStates nActions : ℕ} (selector evaluator : Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))) (nextState : Fin nStates) (reward gamma : α) (done : Bool := false) :

Double Q-learning / Double DQN-style target: choose the greedy action under selector, evaluate it under evaluator.

Instances For

source

def Runtime.RL.Tabular.sarsaUpdate {α : Type} [Context α] {nStates nActions : ℕ} (q : Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))) (state : Fin nStates) (action : Fin nActions) (reward : α) (nextState : Fin nStates) (nextAction : Fin nActions) (gamma stepSize : α) (done : Bool := false) :

Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))

In-place style SARSA update on a Q-table, returned functionally.

Instances For

source

def Runtime.RL.Tabular.expectedSarsaUpdate {α : Type} [Context α] {nStates nActions : ℕ} (q : Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))) (state : Fin nStates) (action : Fin nActions) (reward : α) (nextState : Fin nStates) (nextPolicy : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (gamma stepSize : α) (done : Bool := false) :

Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))

Expected SARSA update on a Q-table.

Instances For

source

def Runtime.RL.Tabular.qLearningUpdate {α : Type} [Context α] {nStates nActions : ℕ} (q : Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))) (state : Fin nStates) (action : Fin nActions) (reward : α) (nextState : Fin nStates) (gamma stepSize : α) (done : Bool := false) :

Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))

Q-learning update on a Q-table.

Instances For

source

def Runtime.RL.Tabular.doubleQUpdateLeft {α : Type} [Context α] {nStates nActions : ℕ} (qLeft qRight : Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))) (state : Fin nStates) (action : Fin nActions) (reward : α) (nextState : Fin nStates) (gamma stepSize : α) (done : Bool := false) :

Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))

Update the left table in Double Q-learning.

Instances For

source

def Runtime.RL.Tabular.doubleQUpdateRight {α : Type} [Context α] {nStates nActions : ℕ} (qLeft qRight : Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))) (state : Fin nStates) (action : Fin nActions) (reward : α) (nextState : Fin nStates) (gamma stepSize : α) (done : Bool := false) :

Spec.Tensor α (Spec.Shape.dim nStates (Spec.Shape.dim nActions Spec.Shape.scalar))

Update the right table in Double Q-learning.

Instances For

TorchLean API

NN.Runtime.RL.Algorithms.Tabular

Tabular Reinforcement Learning #