Bandit Algorithms #

This module implements a small set of classic discrete-action bandit algorithms:

greedy / epsilon-greedy action selection,
UCB1-style confidence bonuses,
incremental sample-average value estimation,
gradient bandits with a softmax policy over preferences.

Primary references:

Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., bandit chapter): http://incompleteideas.net/book/the-book-2nd.html
Auer, Cesa-Bianchi, and Fischer, "Finite-time Analysis of the Multiarmed Bandit Problem" (2002): https://doi.org/10.1023/A:1013689704352
Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning" (1992): https://doi.org/10.1023/A:1022672621406

source

structure Runtime.RL.Bandits.ValueState (α : Type) (nActions : ℕ) :

Type

Value-estimation state for finite-armed bandits.

counts : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)
Per-action pull counts.
values : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)
Per-action estimated values.

Instances For

source

structure Runtime.RL.Bandits.PreferenceState (α : Type) (nActions : ℕ) :

Type

Preference / policy-gradient state for gradient bandits.

steps : α
Number of observed rewards so far (tracked as the ambient scalar type).
preferences : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)
Preference logits over actions.
averageReward : α
Running average reward baseline.

Instances For

source

def Runtime.RL.Bandits.ValueState.init {α : Type} [Context α] {nActions : ℕ} :

ValueState α nActions

Zero-initialized action-value state.

Instances For

source

def Runtime.RL.Bandits.PreferenceState.init {α : Type} [Context α] {nActions : ℕ} :

PreferenceState α nActions

Zero-initialized preference state.

Instances For

source

def Runtime.RL.Bandits.greedyAction? {α : Type} [Context α] {nActions : ℕ} (state : ValueState α nActions) :

Option (Fin nActions)

Greedy action under the current estimates, if the action space is nonempty.

Instances For

source

def Runtime.RL.Bandits.epsilonGreedyAction? {α : Type} [Context α] {nActions : ℕ} (state : ValueState α nActions) (epsilon draw : α) (exploreAction : Fin nActions) :

Option (Fin nActions)

Epsilon-greedy action selection with explicit exploration draw and fallback action.

The caller supplies:

epsilon: exploration probability,
draw: a pre-sampled uniform value in [0,1),
exploreAction: the action to use when the exploration branch is taken.

Instances For

source

def Runtime.RL.Bandits.sampleAverageStep {α : Type} [Context α] {nActions : ℕ} (state : ValueState α nActions) (action : Fin nActions) (reward : α) :

ValueState α nActions

Incremental sample-average update for one bandit arm.

Instances For

source

def Runtime.RL.Bandits.totalPulls {α : Type} [Context α] {nActions : ℕ} (state : ValueState α nActions) :

Total number of pulls recorded in a ValueState.

Instances For

source

def Runtime.RL.Bandits.ucb1Bonus {α : Type} [Context α] (exploration totalPulls actionPulls : α) :

UCB1-style exploration bonus.

We use max(pulls, epsilon) in the denominator so the helper stays total while still giving very large bonuses to unseen or nearly-unseen actions.

Instances For

source

def Runtime.RL.Bandits.ucb1Scores {α : Type} [Context α] {nActions : ℕ} (state : ValueState α nActions) (exploration : α := Numbers.two) :

Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)

Per-action UCB1 scores.

Instances For

source

def Runtime.RL.Bandits.ucb1Action? {α : Type} [Context α] {nActions : ℕ} (state : ValueState α nActions) (exploration : α := Numbers.two) :

Option (Fin nActions)

Best action under UCB1 scores, if the action space is nonempty.

Instances For

source

def Runtime.RL.Bandits.gradientPolicy {α : Type} [Context α] {nActions : ℕ} (state : PreferenceState α nActions) :

Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)

Softmax policy used by the gradient-bandit algorithm.

Instances For

source

def Runtime.RL.Bandits.gradientBanditStep {α : Type} [Context α] {nActions : ℕ} (state : PreferenceState α nActions) (action : Fin nActions) (reward stepSize : α) (useBaseline : Bool := true) :

PreferenceState α nActions

Gradient-bandit preference update with an optional average-reward baseline.

Instances For

TorchLean API

NN.Runtime.RL.Algorithms.Bandits

Bandit Algorithms #