Deep Value-Learning Objectives #
This module packages the core scalar objectives / targets behind common deep RL algorithms:
- DQN and Double DQN,
- DDPG-style actor / critic objectives,
- TD3 clipped double critics,
- SAC entropy-regularized targets and actor objectives.
The functions are intentionally small and typed. They expose the textbook math while leaving experience replay, target-network sync, and optimizer orchestration to higher-level code.
Primary references:
- Mnih et al., "Human-level control through deep reinforcement learning" (2015): https://doi.org/10.1038/nature14236
- van Hasselt, Guez, and Silver, "Deep Reinforcement Learning with Double Q-learning" (2016): https://arxiv.org/abs/1509.06461
- Lillicrap et al., "Continuous Control with Deep Reinforcement Learning" (2015): https://arxiv.org/abs/1509.02971
- Fujimoto et al., "Addressing Function Approximation Error in Actor-Critic Methods" (2018): https://arxiv.org/abs/1802.09477
- Haarnoja et al., "Soft Actor-Critic" (2018): https://arxiv.org/abs/1801.01290
Extract Q(s, a) from a vector of action-values.
Instances For
Maximum Q-value in a vector, defaulting to 0 when nActions = 0.
Instances For
DQN bootstrap target r + γ max_a Q_target(s', a).
Instances For
Double DQN target: select with the online network, evaluate with the target network.
Instances For
DQN temporal-difference residual for one sampled action.
Instances For
Squared TD loss for DQN.
Instances For
Huber TD loss for DQN.
Instances For
Double-DQN temporal-difference residual.
Instances For
Deterministic-policy-gradient actor objective used by DDPG:
maximize Q(s, μ(s)), or equivalently minimize -Q(s, μ(s)).
Instances For
SAC actor objective:
minimize α * log π(a|s) - min(Q1, Q2).