Bandit Algorithms #
This module implements a small set of classic discrete-action bandit algorithms:
- greedy / epsilon-greedy action selection,
- UCB1-style confidence bonuses,
- incremental sample-average value estimation,
- gradient bandits with a softmax policy over preferences.
Primary references:
- Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., bandit chapter): http://incompleteideas.net/book/the-book-2nd.html
- Auer, Cesa-Bianchi, and Fischer, "Finite-time Analysis of the Multiarmed Bandit Problem" (2002): https://doi.org/10.1023/A:1013689704352
- Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning" (1992): https://doi.org/10.1023/A:1022672621406
Value-estimation state for finite-armed bandits.
- counts : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)
Per-action pull counts.
- values : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)
Per-action estimated values.
Instances For
Preference / policy-gradient state for gradient bandits.
- steps : α
Number of observed rewards so far (tracked as the ambient scalar type).
- preferences : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)
Preference logits over actions.
- averageReward : α
Running average reward baseline.
Instances For
Zero-initialized action-value state.
Instances For
Zero-initialized preference state.
Instances For
Greedy action under the current estimates, if the action space is nonempty.
Instances For
Epsilon-greedy action selection with explicit exploration draw and fallback action.
The caller supplies:
epsilon: exploration probability,draw: a pre-sampled uniform value in[0,1),exploreAction: the action to use when the exploration branch is taken.
Instances For
Incremental sample-average update for one bandit arm.
Instances For
Total number of pulls recorded in a ValueState.
Instances For
UCB1-style exploration bonus.
We use max(pulls, epsilon) in the denominator so the helper stays total while still giving
very large bonuses to unseen or nearly-unseen actions.
Instances For
Per-action UCB1 scores.
Instances For
Best action under UCB1 scores, if the action space is nonempty.
Instances For
Softmax policy used by the gradient-bandit algorithm.
Instances For
Gradient-bandit preference update with an optional average-reward baseline.