Autograd Policy-Gradient Objectives #

This module provides differentiable policy-gradient / actor-critic helpers expressed in terms of TorchLean's backend-generic Ops interface, so they can run under:

the eager runtime backend (imperative autograd, GPU-capable), and
the compiled backend (graph recording / proof tooling).

This file lives with the RL runtime, not the TorchLean runtime internals, because these are RL objectives that happen to be differentiable through TorchLean. It is the autograd companion to the pure helpers in NN.Runtime.RL.Algorithms.PolicyGradient:

NN.Runtime.RL.PolicyGradient works with concrete spec tensors (Tensor α s).
NN.Runtime.RL.PolicyGradient.Autograd works with backend refs (RefTy (m := m) (α := α) s) so autograd can differentiate the objectives.

Action Encoding #

We assume categorical (finite-action) policies parameterized by logits, and we represent the chosen action as a one-hot tensor with the same shape as the logits. This keeps the selected log-probability differentiable with respect to the logits without introducing a separate integer index type into the Ops surface.

Primary References #

Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning" (1992): https://doi.org/10.1023/A:1022672621406
Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., 2018), Chapters 12–13: http://incompleteideas.net/book/the-book-2nd.html
Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347

Log-probabilities and entropy (batched, one-hot actions) #

source

def Runtime.RL.PolicyGradient.Autograd.actionLogProbOneHotBatch {α : Type} [Context α] [DecidableEq Spec.Shape] {m : Type → Type} [Monad m] [Autograd.Torch.Ops m α] {batch nActions : ℕ} [hBatch : Fact (0 < batch)] [hAct : Fact (0 < nActions)] (logits actionOneHot : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch (Spec.Shape.dim nActions Spec.Shape.scalar))) (ε : α := Numbers.epsilon) :

m (Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch Spec.Shape.scalar))

Per-sample log-probability for one-hot actions under a batched categorical policy.

Input shapes:

logits : (N × A)
actionOneHot : (N × A)

Output shape:

logProb : (N) where logProb[i] = log π(a_i | s_i).

Implementation note: this uses log_softmax and a reduce-sum over the action axis.

Instances For

source

def Runtime.RL.PolicyGradient.Autograd.entropyMean {α : Type} [Context α] [DecidableEq Spec.Shape] {m : Type → Type} [Monad m] [Autograd.Torch.Ops m α] {batch nActions : ℕ} [hBatch : Fact (0 < batch)] [hAct : Fact (0 < nActions)] (logits : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch (Spec.Shape.dim nActions Spec.Shape.scalar))) (ε : α := Numbers.epsilon) :

m (Autograd.TorchLean.RefTy m α Spec.Shape.scalar)

Mean entropy of a batched categorical policy.

Input shape:

logits : (N × A)

Output shape:

scalar entropy mean: mean_i[ -Σ_a p_i(a) log p_i(a) ].

Instances For

PPO (batched) #

source

def Runtime.RL.PolicyGradient.Autograd.ppoClippedObjectiveBatch {α : Type} [Context α] [DecidableEq Spec.Shape] {m : Type → Type} [Monad m] [Autograd.Torch.Ops m α] {batch nActions : ℕ} [hBatch : Fact (0 < batch)] [hAct : Fact (0 < nActions)] (newLogits actionOneHot : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch (Spec.Shape.dim nActions Spec.Shape.scalar))) (oldLogProb advantage : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch Spec.Shape.scalar)) (clipEps : α := 1 / Coe.coe 5) (ε : α := Numbers.epsilon) :

m (Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch Spec.Shape.scalar))

PPO clipped surrogate objective (the thing to maximize), computed per sample:

L_clip_i = min(r_i * A_i, clip(r_i, 1-ε, 1+ε) * A_i)

where r_i = exp(logπ_new(a_i|s_i) - logπ_old(a_i|s_i)).

Instances For

source

def Runtime.RL.PolicyGradient.Autograd.ppoLossBatch {α : Type} [Context α] [DecidableEq Spec.Shape] {m : Type → Type} [Monad m] [Autograd.Torch.Ops m α] {batch nActions : ℕ} [hBatch : Fact (0 < batch)] [hAct : Fact (0 < nActions)] (newLogits actionOneHot : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch (Spec.Shape.dim nActions Spec.Shape.scalar))) (oldLogProb advantage : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch Spec.Shape.scalar)) (valuePred valueTarget : Autograd.TorchLean.RefTy m α (Spec.Shape.dim batch (Spec.Shape.dim 1 Spec.Shape.scalar))) (clipEps : α := 1 / Coe.coe 5) (valueCoef : α := 1 / Coe.coe 2) (entropyCoef : α := 1 / Coe.coe 100) (ε : α := Numbers.epsilon) :

m (Autograd.TorchLean.RefTy m α Spec.Shape.scalar)

PPO scalar loss to minimize (mean over batch):

loss = -mean(L_clip) + c_v * MSE(v, v_target) - c_e * mean(entropy)

This is the standard discrete-action PPO loss used in many reference implementations.

Instances For

PPO module wrapper (two-model actor/critic) #

source

def Runtime.RL.PolicyGradient.Autograd.ppoActorCriticScalarModuleDef {stateShape : Spec.Shape} {batch nActions : ℕ} [hBatch : Fact (0 < batch)] [hAct : Fact (0 < nActions)] (actor : Autograd.TorchLean.NN.Seq stateShape (Spec.Shape.dim batch (Spec.Shape.dim nActions Spec.Shape.scalar))) (critic : Autograd.TorchLean.NN.Seq stateShape (Spec.Shape.dim batch (Spec.Shape.dim 1 Spec.Shape.scalar))) :

Autograd.TorchLean.Module.ScalarModuleDef (actor.paramShapes ++ critic.paramShapes) [stateShape, Spec.Shape.dim batch (Spec.Shape.dim nActions Spec.Shape.scalar), Spec.Shape.dim batch Spec.Shape.scalar, Spec.Shape.dim batch Spec.Shape.scalar, Spec.Shape.dim batch (Spec.Shape.dim 1 Spec.Shape.scalar)]

Bundle an actor and critic into a ScalarModuleDef whose inputs are a PPO minibatch:

states : (N × stateDim)
actionsOneHot : (N × A)
oldLogProb : (N)
advantages : (N)
valueTarget : (N × 1)

The parameters are actor.params ++ critic.params, and one optimizer step updates both.

Instances For