Policy-Gradient Objectives #
This module exposes typed helpers for the main categorical-policy objectives that modern policy-gradient code tends to rely on:
- REINFORCE,
- advantage actor-critic,
- trust-region / KL-penalized policy-gradient helpers,
- entropy regularization,
- soft actor-critic policy terms,
- PPO's clipped surrogate.
The helpers operate on logits for a finite action space and stay purely functional so they can be used from either eager runtime code or proof-oriented spec code.
Primary references:
- Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning" (1992): https://doi.org/10.1023/A:1022672621406
- Mnih et al., "Asynchronous Methods for Deep Reinforcement Learning" (2016): https://arxiv.org/abs/1602.01783
- Schulman et al., "Trust Region Policy Optimization" (2015): https://arxiv.org/abs/1502.05477
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
- Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438
Softmax policy induced by a vector of logits.
Instances For
Probability of a selected action under a categorical policy.
Instances For
Log-probability of a selected action.
Instances For
Entropy bonus for a categorical policy:
-Σ p(a) log p(a).
Instances For
REINFORCE loss for one sampled action:
-G_t * log π(a_t | s_t).
Instances For
Advantage actor-critic policy loss:
-A_t * log π(a_t | s_t).
Instances For
Value-regression loss used by actor-critic and PPO critics.
Instances For
Combined advantage actor-critic loss: policy term + value regression - entropy bonus.
Instances For
Advantage actor-critic loss with an explicit entropy bonus coefficient.
This is the A2C/A3C-shaped single-sample objective:
-A_t log π(a_t|s_t) + c_v value_loss - c_e H(π(.|s_t)).
Instances For
Importance ratio π_new(a|s) / π_old(a|s) computed from log-probabilities.
Instances For
Categorical KL divergence KL(old || new) from two probability vectors:
Σ_a old(a) * (log old(a) - log new(a)).
Both distributions are clamped into [epsilon, 1-epsilon] before taking logs. This is the scalar
penalty used by TRPO-style diagnostics and KL-penalized policy-gradient objectives.
Instances For
KL divergence KL(π_old(.|s) || π_new(.|s)) from old/new logits.
Instances For
TRPO-style surrogate objective from a precomputed importance ratio:
ratio * A.
TRPO maximizes this surrogate subject to a KL trust-region constraint. We expose the scalar surrogate separately from the constraint so callers can choose line search / penalty / diagnostics.
Instances For
KL-penalized policy-gradient loss:
-(ratio * A) + β * KL(old || new).
This is not the full constrained TRPO optimizer; it is the differentiable scalar objective commonly used as a practical surrogate or diagnostic when implementing trust-region updates.
Instances For
Soft actor-critic categorical actor objective:
temperature * log π(a|s) - Q(s,a).
For continuous SAC the action is reparameterized; for finite actions this scalar is the sampled-action form used inside a categorical policy update.
Instances For
PPO clipped surrogate objective from a precomputed importance ratio:
min(ratio * A, clip(ratio, 1-ε, 1+ε) * A).
This helper is useful when you already have the ratio (e.g. from cached log-probabilities) and want to avoid recomputing it from logits.
Instances For
PPO clipped surrogate objective for one sampled action.
This is the objective to maximize:
min(r_t A_t, clip(r_t, 1-ε, 1+ε) A_t).
Instances For
PPO loss to minimize:
-L_clip + c_v * value_loss - c_e * entropy.
Instances For
Sample from a categorical distribution represented as a probability vector.
seed and counter form an explicit RNG stream identifier. The function returns the incremented
counter together with the sampled action index.
Implementation note: this uses the standard cumulative-sum / inverse-CDF sampler.
Instances For
Sample an action from logits by applying softmax then sampleCategorical.