TorchLean API

NN.Runtime.RL.Algorithms.ValueLearning

Deep Value-Learning Objectives #

This module packages the core scalar objectives / targets behind common deep RL algorithms:

The functions are intentionally small and typed. They expose the textbook math while leaving experience replay, target-network sync, and optimizer orchestration to higher-level code.

Primary references:

def Runtime.RL.ValueLearning.chosenActionValue {α : Type} {nActions : } (qValues : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) :
α

Extract Q(s, a) from a vector of action-values.

Instances For
    def Runtime.RL.ValueLearning.maxQValue {α : Type} [Context α] {nActions : } (qValues : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :
    α

    Maximum Q-value in a vector, defaulting to 0 when nActions = 0.

    Instances For
      def Runtime.RL.ValueLearning.dqnTarget {α : Type} [Context α] {nActions : } (reward gamma : α) (done : Bool) (nextQTarget : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :
      α

      DQN bootstrap target r + γ max_a Q_target(s', a).

      Instances For
        def Runtime.RL.ValueLearning.doubleDqnTarget {α : Type} [Context α] {nActions : } (reward gamma : α) (done : Bool) (nextQOnline nextQTarget : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :
        α

        Double DQN target: select with the online network, evaluate with the target network.

        Instances For
          def Runtime.RL.ValueLearning.dqnResidual {α : Type} [Context α] {nActions : } (qPred : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (reward gamma : α) (done : Bool) (nextQTarget : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :
          α

          DQN temporal-difference residual for one sampled action.

          Instances For
            def Runtime.RL.ValueLearning.dqnMSELoss {α : Type} [Context α] {nActions : } (qPred : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (reward gamma : α) (done : Bool) (nextQTarget : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :
            α

            Squared TD loss for DQN.

            Instances For
              def Runtime.RL.ValueLearning.dqnHuberLoss {α : Type} [Context α] {nActions : } (qPred : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (reward gamma : α) (done : Bool) (nextQTarget : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (delta : α := 1) :
              α

              Huber TD loss for DQN.

              Instances For
                def Runtime.RL.ValueLearning.doubleDqnResidual {α : Type} [Context α] {nActions : } (qPred : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (reward gamma : α) (done : Bool) (nextQOnline nextQTarget : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :
                α

                Double-DQN temporal-difference residual.

                Instances For
                  def Runtime.RL.ValueLearning.ddpgActorObjective {α : Type} [Context α] (criticValue : α) :
                  α

                  Deterministic-policy-gradient actor objective used by DDPG: maximize Q(s, μ(s)), or equivalently minimize -Q(s, μ(s)).

                  Instances For
                    def Runtime.RL.ValueLearning.ddpgCriticTarget {α : Type} [Context α] (reward gamma nextCriticValue : α) (done : Bool := false) :
                    α

                    DDPG critic target r + γ Q_target(s', μ_target(s')).

                    Instances For
                      def Runtime.RL.ValueLearning.td3Target {α : Type} [Context α] (reward gamma nextCritic1 nextCritic2 : α) (done : Bool := false) :
                      α

                      TD3 clipped-double target using the minimum of the two target critics.

                      Instances For
                        def Runtime.RL.ValueLearning.sacTarget {α : Type} [Context α] (reward gamma nextCritic1 nextCritic2 logProb temperature : α) (done : Bool := false) :
                        α

                        SAC entropy-regularized soft target: r + γ (min(Q1', Q2') - α * log π(a'|s')).

                        Instances For
                          def Runtime.RL.ValueLearning.sacActorObjective {α : Type} [Context α] (critic1 critic2 logProb temperature : α) :
                          α

                          SAC actor objective: minimize α * log π(a|s) - min(Q1, Q2).

                          Instances For