TorchLean API

NN.Runtime.RL.Numerics.Float32.PPO

Checked Float32 PPO Objective Helpers #

PPO is usually run with ordinary host floats, but these helpers make the scalar objective pieces executable under the explicit IEEE32Exec model and reject non-finite intermediates. They are useful for regression tests, debugging numerically fragile runs, and connecting runtime checks to proof-side finite hypotheses.

Reference: Schulman et al., "Proximal Policy Optimization Algorithms" (2017).

Checked importance ratio exp(newLogProb - oldLogProb), specialized to IEEE32Exec.

This is the float32-semantics variant of Runtime.RL.PolicyGradient.importanceRatio.

Instances For

    Checked PPO clipped surrogate objective from a precomputed importance ratio:

    min(ratio * A, clip(ratio, 1-ε, 1+ε) * A).

    This avoids re-doing the softmax/log-prob computation when you already have ratios.

    Reference:

    Instances For