TorchLean API

NN.Runtime.RL.Algorithms.PolicyGradient

Policy-Gradient Objectives #

This module exposes typed helpers for the main categorical-policy objectives that modern policy-gradient code tends to rely on:

The helpers operate on logits for a finite action space and stay purely functional so they can be used from either eager runtime code or proof-oriented spec code.

Primary references:

Softmax policy induced by a vector of logits.

Instances For
    def Runtime.RL.PolicyGradient.actionProbability {α : Type} [Context α] {nActions : } (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (epsilon : α := Numbers.epsilon) :
    α

    Probability of a selected action under a categorical policy.

    Instances For
      def Runtime.RL.PolicyGradient.actionLogProbability {α : Type} [Context α] {nActions : } (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (epsilon : α := Numbers.epsilon) :
      α

      Log-probability of a selected action.

      Instances For
        def Runtime.RL.PolicyGradient.entropyBonus {α : Type} [Context α] {nActions : } (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (epsilon : α := Numbers.epsilon) :
        α

        Entropy bonus for a categorical policy: -Σ p(a) log p(a).

        Instances For
          def Runtime.RL.PolicyGradient.reinforceLoss {α : Type} [Context α] {nActions : } (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (returnOrAdvantage : α) (epsilon : α := Numbers.epsilon) :
          α

          REINFORCE loss for one sampled action: -G_t * log π(a_t | s_t).

          Instances For
            def Runtime.RL.PolicyGradient.actorLoss {α : Type} [Context α] {nActions : } (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (advantage : α) (epsilon : α := Numbers.epsilon) :
            α

            Advantage actor-critic policy loss: -A_t * log π(a_t | s_t).

            Instances For
              def Runtime.RL.PolicyGradient.criticLoss {α : Type} [Context α] (valuePrediction valueTarget : α) (valueCoef : α := 1) :
              α

              Value-regression loss used by actor-critic and PPO critics.

              Instances For
                def Runtime.RL.PolicyGradient.actorCriticLoss {α : Type} [Context α] {nActions : } (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (advantage valuePrediction valueTarget : α) (valueCoef : α := 1) (entropyCoef : α := 0) (epsilon : α := Numbers.epsilon) :
                α

                Combined advantage actor-critic loss: policy term + value regression - entropy bonus.

                Instances For
                  def Runtime.RL.PolicyGradient.a2cLoss {α : Type} [Context α] {nActions : } (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (advantage valuePrediction valueTarget : α) (valueCoef : α := 1) (entropyCoef : α := 0) (epsilon : α := Numbers.epsilon) :
                  α

                  Advantage actor-critic loss with an explicit entropy bonus coefficient.

                  This is the A2C/A3C-shaped single-sample objective: -A_t log π(a_t|s_t) + c_v value_loss - c_e H(π(.|s_t)).

                  Instances For
                    def Runtime.RL.PolicyGradient.importanceRatio {α : Type} [Context α] (newLogProb oldLogProb : α) :
                    α

                    Importance ratio π_new(a|s) / π_old(a|s) computed from log-probabilities.

                    Instances For
                      def Runtime.RL.PolicyGradient.categoricalKL {α : Type} [Context α] {nActions : } (oldProbs newProbs : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (epsilon : α := Numbers.epsilon) :
                      α

                      Categorical KL divergence KL(old || new) from two probability vectors: Σ_a old(a) * (log old(a) - log new(a)).

                      Both distributions are clamped into [epsilon, 1-epsilon] before taking logs. This is the scalar penalty used by TRPO-style diagnostics and KL-penalized policy-gradient objectives.

                      Instances For
                        def Runtime.RL.PolicyGradient.categoricalKLFromLogits {α : Type} [Context α] {nActions : } (oldLogits newLogits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (epsilon : α := Numbers.epsilon) :
                        α

                        KL divergence KL(π_old(.|s) || π_new(.|s)) from old/new logits.

                        Instances For
                          def Runtime.RL.PolicyGradient.trpoSurrogateFromRatio {α : Type} [Context α] (ratio advantage : α) :
                          α

                          TRPO-style surrogate objective from a precomputed importance ratio: ratio * A.

                          TRPO maximizes this surrogate subject to a KL trust-region constraint. We expose the scalar surrogate separately from the constraint so callers can choose line search / penalty / diagnostics.

                          Instances For
                            def Runtime.RL.PolicyGradient.klPenalizedPolicyLoss {α : Type} [Context α] (ratio advantage kl penaltyCoef : α) :
                            α

                            KL-penalized policy-gradient loss: -(ratio * A) + β * KL(old || new).

                            This is not the full constrained TRPO optimizer; it is the differentiable scalar objective commonly used as a practical surrogate or diagnostic when implementing trust-region updates.

                            Instances For
                              def Runtime.RL.PolicyGradient.sacCategoricalActorLoss {α : Type} [Context α] {nActions : } (logits qValues : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (temperature : α) (epsilon : α := Numbers.epsilon) :
                              α

                              Soft actor-critic categorical actor objective: temperature * log π(a|s) - Q(s,a).

                              For continuous SAC the action is reparameterized; for finite actions this scalar is the sampled-action form used inside a categorical policy update.

                              Instances For
                                def Runtime.RL.PolicyGradient.ppoClippedObjectiveFromRatio {α : Type} [Context α] (ratio advantage clipEps : α) :
                                α

                                PPO clipped surrogate objective from a precomputed importance ratio:

                                min(ratio * A, clip(ratio, 1-ε, 1+ε) * A).

                                This helper is useful when you already have the ratio (e.g. from cached log-probabilities) and want to avoid recomputing it from logits.

                                Instances For
                                  def Runtime.RL.PolicyGradient.ppoClippedObjective {α : Type} [Context α] {nActions : } (newLogits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (oldLogProb advantage clipEps : α) (epsilon : α := Numbers.epsilon) :
                                  α

                                  PPO clipped surrogate objective for one sampled action.

                                  This is the objective to maximize: min(r_t A_t, clip(r_t, 1-ε, 1+ε) A_t).

                                  Instances For
                                    def Runtime.RL.PolicyGradient.ppoLoss {α : Type} [Context α] {nActions : } (newLogits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (oldLogProb advantage valuePrediction valueTarget clipEps : α) (valueCoef : α := 1) (entropyCoef : α := 0) (epsilon : α := Numbers.epsilon) :
                                    α

                                    PPO loss to minimize: -L_clip + c_v * value_loss - c_e * entropy.

                                    Instances For
                                      def Runtime.RL.PolicyGradient.sampleCategorical {α : Type} [Context α] {nActions : } [Fact (0 < nActions)] (seed counter : ) (probs : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :
                                      × Fin nActions

                                      Sample from a categorical distribution represented as a probability vector.

                                      seed and counter form an explicit RNG stream identifier. The function returns the incremented counter together with the sampled action index.

                                      Implementation note: this uses the standard cumulative-sum / inverse-CDF sampler.

                                      Instances For
                                        def Runtime.RL.PolicyGradient.sampleActionFromLogits {α : Type} [Context α] {nActions : } [Fact (0 < nActions)] (seed counter : ) (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :
                                        × Fin nActions

                                        Sample an action from logits by applying softmax then sampleCategorical.

                                        Instances For