Autograd Policy-Gradient Objectives #
This module provides differentiable policy-gradient / actor-critic helpers expressed in terms of
TorchLean's backend-generic Ops interface, so they can run under:
- the eager runtime backend (imperative autograd, GPU-capable), and
- the compiled backend (graph recording / proof tooling).
This file lives with the RL runtime, not the TorchLean runtime internals, because these are RL
objectives that happen to be differentiable through TorchLean. It is the autograd companion to the
pure helpers in NN.Runtime.RL.Algorithms.PolicyGradient:
NN.Runtime.RL.PolicyGradientworks with concrete spec tensors (Tensor α s).NN.Runtime.RL.PolicyGradient.Autogradworks with backend refs (RefTy (m := m) (α := α) s) so autograd can differentiate the objectives.
Action Encoding #
We assume categorical (finite-action) policies parameterized by logits, and we represent the
chosen action as a one-hot tensor with the same shape as the logits. This keeps the selected
log-probability differentiable with respect to the logits without introducing a separate integer
index type into the Ops surface.
Primary References #
- Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning" (1992): https://doi.org/10.1023/A:1022672621406
- Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., 2018), Chapters 12–13: http://incompleteideas.net/book/the-book-2nd.html
- Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
Log-probabilities and entropy (batched, one-hot actions) #
Per-sample log-probability for one-hot actions under a batched categorical policy.
Input shapes:
logits : (N × A)actionOneHot : (N × A)
Output shape:
logProb : (N)wherelogProb[i] = log π(a_i | s_i).
Implementation note: this uses log_softmax and a reduce-sum over the action axis.
Instances For
Mean entropy of a batched categorical policy.
Input shape:
logits : (N × A)
Output shape:
- scalar entropy mean:
mean_i[ -Σ_a p_i(a) log p_i(a) ].
Instances For
PPO (batched) #
PPO clipped surrogate objective (the thing to maximize), computed per sample:
L_clip_i = min(r_i * A_i, clip(r_i, 1-ε, 1+ε) * A_i)
where r_i = exp(logπ_new(a_i|s_i) - logπ_old(a_i|s_i)).
Instances For
PPO scalar loss to minimize (mean over batch):
loss = -mean(L_clip) + c_v * MSE(v, v_target) - c_e * mean(entropy)
This is the standard discrete-action PPO loss used in many reference implementations.
Instances For
PPO module wrapper (two-model actor/critic) #
Bundle an actor and critic into a ScalarModuleDef whose inputs are a PPO minibatch:
states : (N × stateDim)actionsOneHot : (N × A)oldLogProb : (N)advantages : (N)valueTarget : (N × 1)
The parameters are actor.params ++ critic.params, and one optimizer step updates both.