Deep Value-Learning Objectives #

This module packages the core scalar objectives / targets behind common deep RL algorithms:

DQN and Double DQN,
DDPG-style actor / critic objectives,
TD3 clipped double critics,
SAC entropy-regularized targets and actor objectives.

The functions are intentionally small and typed. They expose the textbook math while leaving experience replay, target-network sync, and optimizer orchestration to higher-level code.

Primary references:

Mnih et al., "Human-level control through deep reinforcement learning" (2015): https://doi.org/10.1038/nature14236
van Hasselt, Guez, and Silver, "Deep Reinforcement Learning with Double Q-learning" (2016): https://arxiv.org/abs/1509.06461
Lillicrap et al., "Continuous Control with Deep Reinforcement Learning" (2015): https://arxiv.org/abs/1509.02971
Fujimoto et al., "Addressing Function Approximation Error in Actor-Critic Methods" (2018): https://arxiv.org/abs/1802.09477
Haarnoja et al., "Soft Actor-Critic" (2018): https://arxiv.org/abs/1801.01290

source

def Runtime.RL.ValueLearning.chosenActionValue {α : Type} {nActions : ℕ} (qValues : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) :

Extract Q(s, a) from a vector of action-values.

Instances For

source

def Runtime.RL.ValueLearning.maxQValue {α : Type} [Context α] {nActions : ℕ} (qValues : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :

Maximum Q-value in a vector, defaulting to 0 when nActions = 0.

Instances For

source

def Runtime.RL.ValueLearning.dqnTarget {α : Type} [Context α] {nActions : ℕ} (reward gamma : α) (done : Bool) (nextQTarget : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :

DQN bootstrap target r + γ max_a Q_target(s', a).

Instances For

source

def Runtime.RL.ValueLearning.doubleDqnTarget {α : Type} [Context α] {nActions : ℕ} (reward gamma : α) (done : Bool) (nextQOnline nextQTarget : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :

Double DQN target: select with the online network, evaluate with the target network.

Instances For

source

def Runtime.RL.ValueLearning.dqnResidual {α : Type} [Context α] {nActions : ℕ} (qPred : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (reward gamma : α) (done : Bool) (nextQTarget : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :

DQN temporal-difference residual for one sampled action.

Instances For

source

def Runtime.RL.ValueLearning.dqnMSELoss {α : Type} [Context α] {nActions : ℕ} (qPred : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (reward gamma : α) (done : Bool) (nextQTarget : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :

Squared TD loss for DQN.

Instances For

source

def Runtime.RL.ValueLearning.dqnHuberLoss {α : Type} [Context α] {nActions : ℕ} (qPred : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (reward gamma : α) (done : Bool) (nextQTarget : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (delta : α := 1) :

Huber TD loss for DQN.

Instances For

source

def Runtime.RL.ValueLearning.doubleDqnResidual {α : Type} [Context α] {nActions : ℕ} (qPred : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) (action : Fin nActions) (reward gamma : α) (done : Bool) (nextQOnline nextQTarget : Spec.Tensor α (Spec.Shape.dim nActions Spec.Shape.scalar)) :

Double-DQN temporal-difference residual.

Instances For

source

def Runtime.RL.ValueLearning.ddpgActorObjective {α : Type} [Context α] (criticValue : α) :

Deterministic-policy-gradient actor objective used by DDPG: maximize Q(s, μ(s)), or equivalently minimize -Q(s, μ(s)).

Instances For

source

def Runtime.RL.ValueLearning.ddpgCriticTarget {α : Type} [Context α] (reward gamma nextCriticValue : α) (done : Bool := false) :

DDPG critic target r + γ Q_target(s', μ_target(s')).

Instances For

source

def Runtime.RL.ValueLearning.td3Target {α : Type} [Context α] (reward gamma nextCritic1 nextCritic2 : α) (done : Bool := false) :

TD3 clipped-double target using the minimum of the two target critics.

Instances For

source

def Runtime.RL.ValueLearning.sacTarget {α : Type} [Context α] (reward gamma nextCritic1 nextCritic2 logProb temperature : α) (done : Bool := false) :

SAC entropy-regularized soft target: r + γ (min(Q1', Q2') - α * log π(a'|s')).

Instances For

source

def Runtime.RL.ValueLearning.sacActorObjective {α : Type} [Context α] (critic1 critic2 logProb temperature : α) :

SAC actor objective: minimize α * log π(a|s) - min(Q1, Q2).

Instances For

TorchLean API

NN.Runtime.RL.Algorithms.ValueLearning

Deep Value-Learning Objectives #