TorchLean API

NN.Runtime.RL.Algorithms.Bandits

Bandit Algorithms #

This module implements a small set of classic discrete-action bandit algorithms:

Primary references:

structure Runtime.RL.Bandits.ValueState (α : Type) (nActions : ) :

Value-estimation state for finite-armed bandits.

Instances For
    structure Runtime.RL.Bandits.PreferenceState (α : Type) (nActions : ) :

    Preference / policy-gradient state for gradient bandits.

    • steps : α

      Number of observed rewards so far (tracked as the ambient scalar type).

    • preferences : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)

      Preference logits over actions.

    • averageReward : α

      Running average reward baseline.

    Instances For
      def Runtime.RL.Bandits.ValueState.init {α : Type} [Context α] {nActions : } :
      ValueState α nActions

      Zero-initialized action-value state.

      Instances For
        def Runtime.RL.Bandits.PreferenceState.init {α : Type} [Context α] {nActions : } :
        PreferenceState α nActions

        Zero-initialized preference state.

        Instances For
          def Runtime.RL.Bandits.greedyAction? {α : Type} [Context α] {nActions : } (state : ValueState α nActions) :
          Option (Fin nActions)

          Greedy action under the current estimates, if the action space is nonempty.

          Instances For
            def Runtime.RL.Bandits.epsilonGreedyAction? {α : Type} [Context α] {nActions : } (state : ValueState α nActions) (epsilon draw : α) (exploreAction : Fin nActions) :
            Option (Fin nActions)

            Epsilon-greedy action selection with explicit exploration draw and fallback action.

            The caller supplies:

            • epsilon: exploration probability,
            • draw: a pre-sampled uniform value in [0,1),
            • exploreAction: the action to use when the exploration branch is taken.
            Instances For
              def Runtime.RL.Bandits.sampleAverageStep {α : Type} [Context α] {nActions : } (state : ValueState α nActions) (action : Fin nActions) (reward : α) :
              ValueState α nActions

              Incremental sample-average update for one bandit arm.

              Instances For
                def Runtime.RL.Bandits.totalPulls {α : Type} [Context α] {nActions : } (state : ValueState α nActions) :
                α

                Total number of pulls recorded in a ValueState.

                Instances For
                  def Runtime.RL.Bandits.ucb1Bonus {α : Type} [Context α] (exploration totalPulls actionPulls : α) :
                  α

                  UCB1-style exploration bonus.

                  We use max(pulls, epsilon) in the denominator so the helper stays total while still giving very large bonuses to unseen or nearly-unseen actions.

                  Instances For
                    def Runtime.RL.Bandits.ucb1Scores {α : Type} [Context α] {nActions : } (state : ValueState α nActions) (exploration : α := Numbers.two) :

                    Per-action UCB1 scores.

                    Instances For
                      def Runtime.RL.Bandits.ucb1Action? {α : Type} [Context α] {nActions : } (state : ValueState α nActions) (exploration : α := Numbers.two) :
                      Option (Fin nActions)

                      Best action under UCB1 scores, if the action space is nonempty.

                      Instances For
                        def Runtime.RL.Bandits.gradientPolicy {α : Type} [Context α] {nActions : } (state : PreferenceState α nActions) :

                        Softmax policy used by the gradient-bandit algorithm.

                        Instances For
                          def Runtime.RL.Bandits.gradientBanditStep {α : Type} [Context α] {nActions : } (state : PreferenceState α nActions) (action : Fin nActions) (reward stepSize : α) (useBaseline : Bool := true) :
                          PreferenceState α nActions

                          Gradient-bandit preference update with an optional average-reward baseline.

                          Instances For