TorchLean API

NN.Runtime.RL.PolicyGradient.Autograd

Autograd Policy-Gradient Objectives #

This module provides differentiable policy-gradient / actor-critic helpers expressed in terms of TorchLean's backend-generic Ops interface, so they can run under:

This file lives with the RL runtime, not the TorchLean runtime internals, because these are RL objectives that happen to be differentiable through TorchLean. It is the autograd companion to the pure helpers in NN.Runtime.RL.Algorithms.PolicyGradient:

Action Encoding #

We assume categorical (finite-action) policies parameterized by logits, and we represent the chosen action as a one-hot tensor with the same shape as the logits. This keeps the selected log-probability differentiable with respect to the logits without introducing a separate integer index type into the Ops surface.

Primary References #

Log-probabilities and entropy (batched, one-hot actions) #

def Runtime.RL.PolicyGradient.Autograd.actionLogProbOneHotBatch {α : Type} [Context α] [DecidableEq Spec.Shape] {m : TypeType} [Monad m] [Autograd.Torch.Ops m α] {batch nActions : } [hBatch : Fact (0 < batch)] [hAct : Fact (0 < nActions)] (logits actionOneHot : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch (Spec.Shape.dim nActions Spec.Shape.scalar))) (ε : α := Numbers.epsilon) :

Per-sample log-probability for one-hot actions under a batched categorical policy.

Input shapes:

  • logits : (N × A)
  • actionOneHot : (N × A)

Output shape:

  • logProb : (N) where logProb[i] = log π(a_i | s_i).

Implementation note: this uses log_softmax and a reduce-sum over the action axis.

Instances For
    def Runtime.RL.PolicyGradient.Autograd.entropyMean {α : Type} [Context α] [DecidableEq Spec.Shape] {m : TypeType} [Monad m] [Autograd.Torch.Ops m α] {batch nActions : } [hBatch : Fact (0 < batch)] [hAct : Fact (0 < nActions)] (logits : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch (Spec.Shape.dim nActions Spec.Shape.scalar))) (ε : α := Numbers.epsilon) :

    Mean entropy of a batched categorical policy.

    Input shape:

    • logits : (N × A)

    Output shape:

    • scalar entropy mean: mean_i[ -Σ_a p_i(a) log p_i(a) ].
    Instances For

      PPO (batched) #

      def Runtime.RL.PolicyGradient.Autograd.ppoClippedObjectiveBatch {α : Type} [Context α] [DecidableEq Spec.Shape] {m : TypeType} [Monad m] [Autograd.Torch.Ops m α] {batch nActions : } [hBatch : Fact (0 < batch)] [hAct : Fact (0 < nActions)] (newLogits actionOneHot : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch (Spec.Shape.dim nActions Spec.Shape.scalar))) (oldLogProb advantage : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch Spec.Shape.scalar)) (clipEps : α := 1 / Coe.coe 5) (ε : α := Numbers.epsilon) :

      PPO clipped surrogate objective (the thing to maximize), computed per sample:

      L_clip_i = min(r_i * A_i, clip(r_i, 1-ε, 1+ε) * A_i)

      where r_i = exp(logπ_new(a_i|s_i) - logπ_old(a_i|s_i)).

      Instances For
        def Runtime.RL.PolicyGradient.Autograd.ppoLossBatch {α : Type} [Context α] [DecidableEq Spec.Shape] {m : TypeType} [Monad m] [Autograd.Torch.Ops m α] {batch nActions : } [hBatch : Fact (0 < batch)] [hAct : Fact (0 < nActions)] (newLogits actionOneHot : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch (Spec.Shape.dim nActions Spec.Shape.scalar))) (oldLogProb advantage : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch Spec.Shape.scalar)) (valuePred valueTarget : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch (Spec.Shape.dim 1 Spec.Shape.scalar))) (clipEps : α := 1 / Coe.coe 5) (valueCoef : α := 1 / Coe.coe 2) (entropyCoef : α := 1 / Coe.coe 100) (ε : α := Numbers.epsilon) :

        PPO scalar loss to minimize (mean over batch):

        loss = -mean(L_clip) + c_v * MSE(v, v_target) - c_e * mean(entropy)

        This is the standard discrete-action PPO loss used in many reference implementations.

        Instances For

          PPO module wrapper (two-model actor/critic) #

          Bundle an actor and critic into a ScalarModuleDef whose inputs are a PPO minibatch:

          • states : (N × stateDim)
          • actionsOneHot : (N × A)
          • oldLogProb : (N)
          • advantages : (N)
          • valueTarget : (N × 1)

          The parameters are actor.params ++ critic.params, and one optimizer step updates both.

          Instances For