Checked Float32 PPO Objective Helpers #
PPO is usually run with ordinary host floats, but these helpers make the scalar objective pieces
executable under the explicit IEEE32Exec model and reject non-finite intermediates. They are useful
for regression tests, debugging numerically fragile runs, and connecting runtime checks to proof-side
finite hypotheses.
Reference: Schulman et al., "Proximal Policy Optimization Algorithms" (2017).
Checked importance ratio exp(newLogProb - oldLogProb), specialized to IEEE32Exec.
This is the float32-semantics variant of Runtime.RL.PolicyGradient.importanceRatio.
Instances For
Checked PPO clipped surrogate objective from a precomputed importance ratio:
min(ratio * A, clip(ratio, 1-ε, 1+ε) * A).
This avoids re-doing the softmax/log-prob computation when you already have ratios.
Reference:
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347