Policy-Gradient Objectives #

This module exposes typed helpers for the main categorical-policy objectives that modern policy-gradient code tends to rely on:

REINFORCE,
advantage actor-critic,
trust-region / KL-penalized policy-gradient helpers,
entropy regularization,
soft actor-critic policy terms,
PPO's clipped surrogate.

The helpers operate on logits for a finite action space and stay purely functional so they can be used from either eager runtime code or proof-oriented spec code.

Primary references:

Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning" (1992): https://doi.org/10.1023/A:1022672621406
Mnih et al., "Asynchronous Methods for Deep Reinforcement Learning" (2016): https://arxiv.org/abs/1602.01783
Schulman et al., "Trust Region Policy Optimization" (2015): https://arxiv.org/abs/1502.05477
Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347
Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2015): https://arxiv.org/abs/1506.02438

source

def Runtime.RL.PolicyGradient.actionPolicy {α : Type} [Context α] {nActions : ℕ} (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :

Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)

Softmax policy induced by a vector of logits.

Instances For

source

def Runtime.RL.PolicyGradient.actionProbability {α : Type} [Context α] {nActions : ℕ} (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (epsilon : α := Numbers.epsilon) :

Probability of a selected action under a categorical policy.

Instances For

source

def Runtime.RL.PolicyGradient.actionLogProbability {α : Type} [Context α] {nActions : ℕ} (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (epsilon : α := Numbers.epsilon) :

Log-probability of a selected action.

Instances For

source

def Runtime.RL.PolicyGradient.entropyBonus {α : Type} [Context α] {nActions : ℕ} (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (epsilon : α := Numbers.epsilon) :

Entropy bonus for a categorical policy: -Σ p(a) log p(a).

Instances For

source

def Runtime.RL.PolicyGradient.reinforceLoss {α : Type} [Context α] {nActions : ℕ} (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (returnOrAdvantage : α) (epsilon : α := Numbers.epsilon) :

REINFORCE loss for one sampled action: -G_t * log π(a_t | s_t).

Instances For

source

def Runtime.RL.PolicyGradient.actorLoss {α : Type} [Context α] {nActions : ℕ} (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (advantage : α) (epsilon : α := Numbers.epsilon) :

Advantage actor-critic policy loss: -A_t * log π(a_t | s_t).

Instances For

source

def Runtime.RL.PolicyGradient.criticLoss {α : Type} [Context α] (valuePrediction valueTarget : α) (valueCoef : α := 1) :

Value-regression loss used by actor-critic and PPO critics.

Instances For

source

def Runtime.RL.PolicyGradient.actorCriticLoss {α : Type} [Context α] {nActions : ℕ} (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (advantage valuePrediction valueTarget : α) (valueCoef : α := 1) (entropyCoef : α := 0) (epsilon : α := Numbers.epsilon) :

Combined advantage actor-critic loss: policy term + value regression - entropy bonus.

Instances For

source

def Runtime.RL.PolicyGradient.a2cLoss {α : Type} [Context α] {nActions : ℕ} (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (advantage valuePrediction valueTarget : α) (valueCoef : α := 1) (entropyCoef : α := 0) (epsilon : α := Numbers.epsilon) :

Advantage actor-critic loss with an explicit entropy bonus coefficient.

This is the A2C/A3C-shaped single-sample objective: -A_t log π(a_t|s_t) + c_v value_loss - c_e H(π(.|s_t)).

Instances For

source

def Runtime.RL.PolicyGradient.importanceRatio {α : Type} [Context α] (newLogProb oldLogProb : α) :

Importance ratio π_new(a|s) / π_old(a|s) computed from log-probabilities.

Instances For

source

def Runtime.RL.PolicyGradient.categoricalKL {α : Type} [Context α] {nActions : ℕ} (oldProbs newProbs : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (epsilon : α := Numbers.epsilon) :

Categorical KL divergence KL(old || new) from two probability vectors: Σ_a old(a) * (log old(a) - log new(a)).

Both distributions are clamped into [epsilon, 1-epsilon] before taking logs. This is the scalar penalty used by TRPO-style diagnostics and KL-penalized policy-gradient objectives.

Instances For

source

def Runtime.RL.PolicyGradient.categoricalKLFromLogits {α : Type} [Context α] {nActions : ℕ} (oldLogits newLogits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (epsilon : α := Numbers.epsilon) :

KL divergence KL(π_old(.|s) || π_new(.|s)) from old/new logits.

Instances For

source

def Runtime.RL.PolicyGradient.trpoSurrogateFromRatio {α : Type} [Context α] (ratio advantage : α) :

TRPO-style surrogate objective from a precomputed importance ratio: ratio * A.

TRPO maximizes this surrogate subject to a KL trust-region constraint. We expose the scalar surrogate separately from the constraint so callers can choose line search / penalty / diagnostics.

Instances For

source

def Runtime.RL.PolicyGradient.klPenalizedPolicyLoss {α : Type} [Context α] (ratio advantage kl penaltyCoef : α) :

KL-penalized policy-gradient loss: -(ratio * A) + β * KL(old || new).

This is not the full constrained TRPO optimizer; it is the differentiable scalar objective commonly used as a practical surrogate or diagnostic when implementing trust-region updates.

Instances For

source

def Runtime.RL.PolicyGradient.sacCategoricalActorLoss {α : Type} [Context α] {nActions : ℕ} (logits qValues : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (temperature : α) (epsilon : α := Numbers.epsilon) :

Soft actor-critic categorical actor objective: temperature * log π(a|s) - Q(s,a).

For continuous SAC the action is reparameterized; for finite actions this scalar is the sampled-action form used inside a categorical policy update.

Instances For

source

def Runtime.RL.PolicyGradient.ppoClippedObjectiveFromRatio {α : Type} [Context α] (ratio advantage clipEps : α) :

PPO clipped surrogate objective from a precomputed importance ratio:

min(ratio * A, clip(ratio, 1-ε, 1+ε) * A).

This helper is useful when you already have the ratio (e.g. from cached log-probabilities) and want to avoid recomputing it from logits.

Instances For

source

def Runtime.RL.PolicyGradient.ppoClippedObjective {α : Type} [Context α] {nActions : ℕ} (newLogits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (oldLogProb advantage clipEps : α) (epsilon : α := Numbers.epsilon) :

PPO clipped surrogate objective for one sampled action.

This is the objective to maximize: min(r_t A_t, clip(r_t, 1-ε, 1+ε) A_t).

Instances For

source

def Runtime.RL.PolicyGradient.ppoLoss {α : Type} [Context α] {nActions : ℕ} (newLogits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (oldLogProb advantage valuePrediction valueTarget clipEps : α) (valueCoef : α := 1) (entropyCoef : α := 0) (epsilon : α := Numbers.epsilon) :

PPO loss to minimize: -L_clip + c_v * value_loss - c_e * entropy.

Instances For

source

def Runtime.RL.PolicyGradient.sampleCategorical {α : Type} [Context α] {nActions : ℕ} [Fact (0 < nActions)] (seed counter : ℕ) (probs : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :

ℕ × Fin nActions

Sample from a categorical distribution represented as a probability vector.

seed and counter form an explicit RNG stream identifier. The function returns the incremented counter together with the sampled action index.

Implementation note: this uses the standard cumulative-sum / inverse-CDF sampler.

Instances For

source

def Runtime.RL.PolicyGradient.sampleActionFromLogits {α : Type} [Context α] {nActions : ℕ} [Fact (0 < nActions)] (seed counter : ℕ) (logits : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :

ℕ × Fin nActions

Sample an action from logits by applying softmax then sampleCategorical.

Instances For

TorchLean API

NN.Runtime.RL.Algorithms.PolicyGradient

Policy-Gradient Objectives #