Checked Float32 PPO Objective Helpers #

PPO is usually run with ordinary host floats, but these helpers make the scalar objective pieces executable under the explicit IEEE32Exec model and reject non-finite intermediates. They are useful for regression tests, debugging numerically fragile runs, and connecting runtime checks to proof-side finite hypotheses.

Reference: Schulman et al., "Proximal Policy Optimization Algorithms" (2017).

source

def Runtime.RL.Numerics.Float32.importanceRatioIEEE32ExecChecked (newLogProb oldLogProb : Float32Exec) :

Except String Float32Exec

Checked importance ratio exp(newLogProb - oldLogProb), specialized to IEEE32Exec.

This is the float32-semantics variant of Runtime.RL.PolicyGradient.importanceRatio.

Instances For

source

def Runtime.RL.Numerics.Float32.ppoClippedObjectiveFromRatioIEEE32ExecChecked (ratio advantage clipEps : Float32Exec) :

Except String Float32Exec

Checked PPO clipped surrogate objective from a precomputed importance ratio:

min(ratio * A, clip(ratio, 1-ε, 1+ε) * A).

This avoids re-doing the softmax/log-prob computation when you already have ratios.

Reference:

Schulman et al., "Proximal Policy Optimization Algorithms" (2017): https://arxiv.org/abs/1707.06347

Instances For

TorchLean API

NN.Runtime.RL.Numerics.Float32.PPO

Checked Float32 PPO Objective Helpers #