RL Core Proofs #

Small structural theorems about TorchLean's pure RL helper functions in NN.Spec.RL.Core.

The emphasis here is on shape/structure properties (mostly list lengths and truncation behaviour). These facts are often used to justify that derived quantities (discounted returns, GAE advantages, etc.) align with the rollout data they came from.

These are intentionally modest but useful:

discounted-return helpers preserve list length,
GAE preserves list length,
returnsFromAdvantages truncates to the shorter list.

References:

Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., 2018), Sections 3-6 (returns, discounted backups, TD ideas): http://incompleteideas.net/book/the-book-2nd.html
Schulman et al., “High-Dimensional Continuous Control Using Generalized Advantage Estimation” (2016): https://arxiv.org/abs/1506.02438

source

theorem Proofs.RL.Core.discountedReturnsFrom_length {α : Type} [Zero α] [Add α] [Mul α] (gamma : α) (rewards : List α) (bootstrap : α) :

(Spec.RL.discountedReturnsFrom gamma rewards bootstrap).length = rewards.length

discountedReturnsFrom produces exactly one return per input reward.

source

theorem Proofs.RL.Core.discountedReturns_length {α : Type} [Zero α] [Add α] [Mul α] (gamma : α) (rewards : List α) :

(Spec.RL.discountedReturns gamma rewards).length = rewards.length

discountedReturns preserves list length.

source

theorem Proofs.RL.Core.discountedReturnsDone_length_eq_min {α : Type} [Zero α] [One α] [Add α] [Mul α] (gamma : α) (rewards : List α) (dones : List Bool) (bootstrap : α) :

(Spec.RL.discountedReturnsDone gamma rewards dones bootstrap).length = min rewards.length dones.length

discountedReturnsDone returns one value per paired reward/done entry.

source

theorem Proofs.RL.Core.discountedReturnsDone_length_of_eqLength {α : Type} [Zero α] [One α] [Add α] [Mul α] (gamma : α) (rewards : List α) (dones : List Bool) (bootstrap : α) (h : rewards.length = dones.length) :

(Spec.RL.discountedReturnsDone gamma rewards dones bootstrap).length = rewards.length

When reward and done lists have the same length, discountedReturnsDone preserves that length.

source

theorem Proofs.RL.Core.generalizedAdvantageEstimation_length {α : Type} [Zero α] [One α] [Add α] [Mul α] [Sub α] (gamma lam : α) (steps : List (Spec.RL.AdvantageStep α)) :

(Spec.RL.generalizedAdvantageEstimation gamma lam steps).length = steps.length

Generalized-advantage-estimation returns one advantage per input step.

source

theorem Proofs.RL.Core.returnsFromAdvantages_length {α : Type} [Add α] (advantages values : List α) :

(Spec.RL.returnsFromAdvantages advantages values).length = min advantages.length values.length

returnsFromAdvantages truncates to the shorter of the two input lists.

TorchLean API

NN.Proofs.RL.Core

RL Core Proofs #