TorchLean API

NN.Spec.RL.Core

Core Reinforcement-Learning Definitions #

This module collects the small mathematical definitions that sit underneath TorchLean's RL development.

These definitions are intentionally spec-level rather than runtime-level:

That keeps the actual RL mathematics in a proof-friendly namespace and avoids duplicating it inside runtime/trainer code.

Why Lists (Not Tensors)? #

Several helpers here operate on List α rather than Tensor α (.dim n .scalar) on purpose.

When you do have a fixed horizon n, it is reasonable to use Fin n → α or a vector tensor and define specialized “returns/GAE” helpers on top. We keep the core definitions here compact and general, and add fixed-horizon variants where they meaningfully improve downstream code.

Primary references:

structure Spec.RL.AdvantageStep (α : Type) :

Small record used by generalized-advantage-estimation helpers.

  • reward : α

    Immediate reward r_t.

  • value : α

    Baseline / critic value estimate V(s_t).

  • nextValue : α

    Bootstrap value V(s_{t+1}).

  • done : Bool

    Episode termination flag.

Instances For
    def Spec.RL.continueMask {α : Type} [Zero α] [One α] (done : Bool) :
    α

    Convert a terminal flag into a multiplicative continuation mask (1 for continue, 0 for stop).

    Instances For
      def Spec.RL.discountedBackup {α : Type} [Zero α] [One α] [Add α] [Mul α] (reward gamma bootstrap : α) (done : Bool) :
      α

      Bellman-style one-step backup: r + γ * (1 - done) * bootstrap.

      Instances For
        def Spec.RL.tdTarget {α : Type} [Zero α] [One α] [Add α] [Mul α] (reward gamma nextValue : α) (done : Bool) :
        α

        One-step TD target for state-value or action-value updates.

        Instances For
          def Spec.RL.tdResidual {α : Type} [Zero α] [One α] [Add α] [Mul α] [Sub α] (value reward gamma nextValue : α) (done : Bool) :
          α

          TD residual / Bellman error: r + γ * (1-d) * nextValue - value.

          Instances For
            def Spec.RL.discountedReturnsFrom {α : Type} [Zero α] [Add α] [Mul α] (gamma : α) (rewards : List α) (bootstrap : α := 0) :
            List α

            Discounted returns with a bootstrap value on the far right: G_t = r_t + γ G_{t+1}.

            Instances For
              def Spec.RL.discountedReturns {α : Type} [Zero α] [Add α] [Mul α] (gamma : α) (rewards : List α) :
              List α

              Discounted returns for a terminal trajectory (bootstrap defaults to 0).

              Instances For
                def Spec.RL.discountedReturnsDone {α : Type} [Zero α] [One α] [Add α] [Mul α] (gamma : α) (rewards : List α) (dones : List Bool) (bootstrap : α := 0) :
                List α

                Discounted returns with explicit termination markers.

                When done = true, the future return is reset before bootstrapping the current reward.

                Instances For
                  def Spec.RL.generalizedAdvantageEstimation {α : Type} [Zero α] [One α] [Add α] [Mul α] [Sub α] (gamma lam : α) (steps : List (AdvantageStep α)) :
                  List α

                  Generalized Advantage Estimation (GAE).

                  Each input step provides r_t, V(s_t), V(s_{t+1}), and done_t. The resulting list contains advantages in forward time order.

                  Instances For
                    def Spec.RL.returnsFromAdvantages {α : Type} [Add α] :
                    List αList αList α

                    Recover lambda-returns from advantages and baseline values via R_t = A_t + V(s_t).

                    Instances For