TorchLean

3.3. Runtime Internals and Artifacts🔗

TorchLean's runtime layer is where the specs become an executable program. It does two jobs: it supports demos and training loops inside Lean, and it records computations into trace/graph artifacts that later chapters can inspect and verify.

If a small model has not run all the way through yet, Training From Scratch is the best first stop; it makes the runtime layer much easier to ground.

The most useful way to read this material is to keep PyTorch's training loop in mind and then ask how TorchLean makes each step explicit.

3.3.1. PyTorch Mental Model🔗

PyTorch's default workflow is roughly:

  1. create parameters and modules,

  2. run a forward pass,

  3. let autograd record a dynamic computation graph for the ops run during the forward pass,

  4. call loss.backward() to accumulate gradients into tensor .grad fields,

  5. call optimizer.step() to mutate the parameters,

  6. optionally zero the gradients and repeat.

That is the mental model we use here, because TorchLean deliberately mirrors the shape of that workflow while changing the semantics of the bookkeeping.

TorchLean mapping:

  • PyTorch parameters and modules become explicit typed values and parameter bundles.

  • PyTorch's dynamic autograd tape corresponds to NN.Runtime.Autograd.Engine.Core.Tape.

  • PyTorch .grad accumulation becomes explicit gradient results returned by reverse mode.

  • PyTorch optimizer.step() corresponds to train.step, train.stepper, or session level update helpers.

  • PyTorch eager execution corresponds to TorchLean eager mode.

  • PyTorch torch.compile and graph lowering intuition correspond to TorchLean compiled mode and the shared IR.

The key difference is that TorchLean keeps the graph and the shapes visible in the types instead of burying them in mutable runtime objects.

Two more layers are worth keeping in mind for the rest of the runtime chapter:

  • NN.Runtime.Autograd.TorchLean.Module helps cast Float literals into the active runtime scalar and package a scalar-loss model in a way that is easier to instantiate repeatedly.

  • NN.Runtime.Autograd.TorchLean.Session is the explicit training environment: it owns leaves, tapes, RNG state, and backend selection. The workflow feels familiar from PyTorch, but parameters and gradients stay explicit.

Those are the pieces that make TorchLean's runtime feel familiar without hiding the bookkeeping.

3.3.2. Runtime Design Decisions🔗

The runtime layer is deliberately not a clone of PyTorch internals. It borrows the workflow people know, then changes the representation so the computation remains inspectable from Lean.

  1. Gradients are data, not hidden tensor fields. The reverse pass returns gradient values rather than mutating .grad slots behind the scenes.

  2. Tapes are explicit. Eager execution records values, parent ids, and VJP closures in a Lean data structure that widgets and proof bridges can inspect.

  3. Compiled execution is separate from eager debugging. Compiled mode gives a stable SSA/DAG-like artifact for repeated evaluation and proof alignment, while eager mode stays close to step by step debugging.

  4. The verifier uses an IR with operation tags, not arbitrary closures. Runtime closures are great for execution, but verification needs explicit operation tags; that is why NN.IR.Graph exists.

  5. CUDA is a backend choice, not a change in model meaning. Device buffers accelerate supported float32 operations, while the Lean spec and IR semantics remain the reference meaning.

This is the main tradeoff of TorchLean's runtime: it is more explicit than a Python training loop, but that explicitness is what lets examples, widgets, proof statements, and verification tools talk about the same computation.

3.3.2.1. PyTorch Similarity🔗

The easiest way to understand TorchLean autograd is to compare it with a standard PyTorch loop.

PyTorch:

import torch

x = torch.tensor([0.5, -1.2], requires_grad=True)
y = (x * x).mean()
y.backward()
print(x.grad)

TorchLean:

import NN

open Spec
open Tensor
open NN.Tensor
open NN.API

def sumsq : autograd.fn1.Fn (Shape.Vec 2) Shape.scalar :=
  fun x => do
    let y ← nn.functional.square x
    nn.functional.mean y

def demo : IO Unit := do
  let x : Tensor Float (Shape.Vec 2) := tensorND! [2] [0.5, -1.2]
  let g ← autograd.fn1.grad (α := Float) sumsq x
  IO.println s!"grad = {Spec.pretty g}"

The model is the same, but TorchLean makes two things explicit:

  • the gradient is returned, not hidden in a mutable field;

  • the tensor shape is checked at the type level.

The runtime does not replace the spec layer. It sits downstream of that denotational layer and reuses the same model definitions.

For device execution, build flags, cuBLAS/cuFFT, deterministic reductions, or the fused FNO spectral-convolution path, continue with GPU and CUDA. The runtime abstraction is separate from the native backend, and that separation is part of the trust model.

For runnable examples that sit close to this runtime layer, see:

For the API files behind those helpers, start from:

3.3.3. What the Runtime Adds🔗

The runtime is where TorchLean stops being only a collection of definitions and theorems and starts behaving like something that can be run and inspected. That matters for two rather different audiences:

  • debugging a model, loss, or training loop,

  • and connecting executable artifacts to a soundness theorem.

The key design principle is that the runtime is not a second source of truth. It is a controlled execution layer whose artifacts should continue to align with the spec and IR semantics described elsewhere in the manual.

This is why TorchLean splits the system into layers:

  • the spec layer says what the model means;

  • the runtime layer says how that meaning runs and how gradients are produced;

  • the compiled/IR layer gives a replayable artifact that verification can inspect;

  • the proof layer states theorems about the same underlying object, not a separately exported copy.

That is different from the usual Python stack, where the executed object, the exported object, and the verified object often live in different toolchains.

3.3.4. Two Execution Modes🔗

TorchLean exposes one front-end with two execution backends:

  • Eager: tape recording and reverse-mode backprop in the style familiar from PyTorch.

  • Compiled: record a graph-like artifact and run it as a replayable, optimizer-friendly execution.

Many of the curated demos accept --backend eager|compiled.

\text{NN.API program} \;\to\; \{\text{eager tape},\text{compiled GraphData}\} \;\to\; NN.IR.Graph \;\to\; \{\text{IBP},\text{CROWN},\text{certificates}\}

3.3.5. What "eager vs compiled" means (in concrete terms)🔗

Both backends execute the same public model and loss definitions from NN.API source tree. The difference lies in the runtime representation of the computation that was executed:

  • Eager builds a Tape of runtime nodes as it executes.

    • Each node stores a value plus a backward closure that knows how to send gradients to parents.

    • This is the closest analogue TorchLean has to everyday PyTorch eager mode.

  • Compiled builds a typed SSA/DAG representation (GraphData) and evaluates it in one shot.

    • Nodes are typed records containing forward, jvp, and vjp.

    • This is the "compile once, run many times" backend.

    • It is also the backend that is easiest to connect to autograd correctness proofs, because the graph representation is typed and structurally recursive.

Typical usage split:

  • eager for stepping through model code, gradients, or training loops.

  • compiled for repeated evaluation speed and a stable graph artifact, within the supported operator set of the compiled backend.

The operational difference shows up by running the same demo twice:

lake env lean --run NN/Examples/Quickstart/SimpleMlpTrain.lean -- --steps 50 --dtype float --backend eager
lake env lean --run NN/Examples/Quickstart/SimpleMlpTrain.lean -- --steps 50 --dtype float --backend compiled

With matching seeds and supported operators, the forward mathematics agree; what changes is mostly how much structure can be inspected step by step:

  • eager is easier to step through,

  • compiled is easier to replay and connect to proof artifacts.

3.3.6. Runtime Contexts And Named Values🔗

Spec tensors are dependently typed by shape (Tensor α s), but realistic training loops need registries:

  • parameter maps ("w1" ↦ weights),

  • gradient maps ("w1" ↦ dL/dw1),

  • "named values" for debugging.

TorchLean therefore uses an existential wrapper:

Runtime.AnyTensor α

which pairs a Shape with a corresponding tensor, allowing heterogeneous registries.

See:

This design preserves the strong typing of the spec layer while still supporting runtime-style tooling.

3.3.7. A Small Runtime Walkthrough🔗

For a single training step, the objects move like this:

  1. Build a model and parameter bundle through NN.API.

  2. Run a forward pass with a chosen backend (eager, compiled, or CUDA backed eager where supported).

  3. Produce a scalar loss.

  4. Run reverse mode to obtain explicit gradients.

  5. Pass parameters and gradients to an optimizer update.

  6. Log metrics and, when needed, inspect the tape or compiled graph.

None of those steps changes what the model means. They only change which runtime artifact is produced and how much structure is available for inspection.

3.3.8. Autograd Tape (Eager Mode)🔗

The eager engine records operations into a Tape:

  • nodes store forward values and a local VJP rule (backward closure),

  • reverse-mode traverses ids backwards and accumulates gradients.

In PyTorch terms, this is the part of the system that most closely resembles the autograd engine behind loss.backward(), except TorchLean makes the tape explicit and the gradient flow explicit. There is no hidden .grad mutation on tensor objects; the reverse pass produces the gradients directly.

The tape structure is useful because it mirrors the conceptual graph you already see in PyTorch, but it keeps the graph available for debugging and verification instead of making it an implementation detail.

Core implementation:

3.3.8.1. A Minimal "autograd.grad" Example (Scalar Output)🔗

For readers who know PyTorch, the closest mental model for this section is:

  • y = f(x)

  • dy/dx = torch.autograd.grad(y, x)

In TorchLean, the API facade for "single-input tensor functions" is API.autograd.fn1. Here is a small self-contained example:

import NN

open Spec
open Tensor
open NN.Tensor
open NN.API

-- A scalar-valued function: mean(square x).
def sumsq : autograd.fn1.Fn (Shape.Vec 2) Shape.scalar :=
  fun x => do
    let y ← nn.functional.square x
    nn.functional.mean y

def demo : IO Unit := do
  let x : Tensor Float (Shape.Vec 2) := tensorND! [2] [0.5, -1.2]
  let g ← autograd.fn1.grad (α := Float) sumsq x
  IO.println s!"x    = {Spec.pretty x}"
  IO.println s!"grad = {Spec.pretty g}"

The curated version of this small autodiff tour, with Jacobians, HVPs, detach, and the rest, is:

3.3.8.2. Inspecting Tapes (Debugging Like PyTorch)🔗

The tape is intentionally debuggable by construction:

  • Each runtime node carries:

    • an optional name,

    • parent ids,

    • a stored forward value,

    • a backward closure.

The widgets example includes a tiny tape and its gradients.

In particular:

  • #tape_view t renders the graph (nodes, parents, values).

  • #tape_grads_view t, outId runs scalar backprop and shows which nodes got gradients.

  • #tape_trace_view t, outId shows the reverse traversal step by step, including each node's parent contributions.

An informal correctness statement, capturing the shape of the theorem one would cite, is:

For a tape whose nodes satisfy the local VJP laws, the global reverse pass computes the adjoint of

the Fréchet derivative of the tape denotation.

Equivalently, backpropagation recovers the correct gradients when the local VJP laws match the Lean proofs and the tape satisfies the expected structural conditions.

Concrete theorem names behind the informal autograd description:

#check Proofs.Autograd.Algebra.backprop_correct
#check NN.Proofs.Autograd.Runtime.compileAux_ctx_eq_eval
#check NN.Proofs.Autograd.Runtime.backwardDenseFrom_compileAux_eq_backpropAllCtx

The theorem shape is:

\left(\forall \text{ node},\;\mathrm{localVJP}(node) = D(\mathrm{localForward}(node))^\ast\right) \quad\Longrightarrow\quad \mathrm{backprop}(G) = D(\mathrm{denote}(G))^\ast

That is why the local op files matter. A wrong VJP for softmax, attention, layer norm, or a recurrent cell would not merely be a local typo; it would break the global reverse mode statement.

Other proof landmarks that explain why TorchLean's runtime and tensor stack is trustworthy:

  • scaled dot product attention backprop:

NN.Proofs.Autograd.Tape.Ops.Attention.ScaledDotProduct.backprop_eq_adjoint_fderiv_scaledDotProduct
  • attention weight normalization:

NN.Proofs.Models.Attention.Weights.scaledDotProductAttention_weights_row_sum_one
  • NN.Proofs.Autograd.Core.RealCorrectness.* for the real-semantics side of autodiff correctness

  • NN.Proofs.Autograd.Core.SemiringCorrectness.* for the algebraic side of the same proof layer

  • NN.Verification.ProofBackedCertificates.ibpCertificateSoundness

  • NN.Verification.ProofBackedCertificates.crownCertificateSoundness

  • NN.Verification.ProofBackedCertificates.runIBPSoundness

If you want the floating-point enclosure layer specifically, the most relevant theorem surfaces are:

A useful order through the proof layer is:

  • backprop_correct states backend-agnostic reverse-mode correctness for the algebraic tape model.

  • compileAux_ctx_eq_eval links compiled runtime graph data to the corresponding forward evaluator.

  • backwardDenseFrom_compileAux_eq_backpropAllCtx shows that the executable runtime backpropagation matches the tape-level theorem proved in the algebraic layer.

3.3.9. Compiled Graphs🔗

The compiled runtime uses a typed SSA/DAG representation:

Proofs.Autograd.Algebra.GraphData

At this level, each node bundles:

  • forward (compute a value),

  • jvp (forward-mode pushforward),

  • vjp (reverse-mode pullback).

These are precisely the ingredients needed to derive global correctness from local adjointness laws.

The reason to keep this compiled form separate from the eager tape is practical:

  • eager mode is best for debugging and notebook-style iteration;

  • compiled mode is best for a stable, replayable artifact that proof code can reason about;

  • the same model definition can feed both paths, so the proof layer and the execution layer do not diverge.

TorchLean does not rely on a hidden global autograd engine. The current execution mode is part of the API, not an ambient effect.

3.3.9.1. Session / Parameter Similarity🔗

PyTorch often uses mutable modules plus an optimizer object:

opt = torch.optim.SGD(model.parameters(), lr=0.05)

TorchLean keeps the same basic idea, but packages it explicitly in the runtime session/module layer:

-- conceptual shape, not exact syntax:
-- session owns parameters, inputs, tape, and backend choice
-- train.step applies one forward/backward/update step

This is what NN.Runtime.Autograd.TorchLean.Session is doing under the hood: it is the explicit version of the same “model + state + optimizer” workflow, with the backend selection made visible.

API starting points:

3.3.9.2. IR Execution Bridge🔗

For a stable, first-order IR that can also be consumed by verification passes, TorchLean standardizes on NN.IR.Graph, the operation tagged DAG described in Graphs and IR.

To connect that IR to the compiled runtime backend, TorchLean provides the IR execution compiler.

This compiles an operation tagged IR graph (NN.IR.Graph) together with an IR payload store (NN.IR.Payload) into an executable SSA graph (ExecGraphData) that can be evaluated repeatedly.

In words:

execGraphOfIR produces a GraphData whose forward evaluation agrees with the IR evaluator

(NN.IR.Graph.denoteAll) on the same payload and input.

The proof references behind this bridge are:

  • common lemmas and per op chunks in the IRExec correctness API;

  • the larger proof files that assemble those local facts into the full compiled execution theorem.

3.3.10. Why Runtime and Verification Stay Separate🔗

The eager tape nodes are closures rather than symbolic op tags. That is excellent for execution, but much less suitable for verification passes such as IBP or CROWN, which require an explicit operator vocabulary.

TorchLean therefore keeps a first order IR with operation tags (NN.IR.Graph) and treats it as the canonical verification target. The compilation and reification path is covered in Graphs and IR.

3.3.11. Autograd Correctness Theorems🔗

TorchLean's autograd correctness is not a single monolithic theorem; it is layered into reusable pieces:

  1. Local correctness of primitive ops The real autograd correctness API and semiring autograd correctness API state that each primitive VJP/JVP rule is the intended derivative or adjoint rule.

  2. Global reverse mode correctness over a tape/DAG The tape algebra soundness API proves that composing locally correct nodes yields a globally correct reverse pass.

  3. Link the executable runtime tape to the proved tape model The runtime autograd link API connects the executable tape representation to the proof model.

For a proof tour that parallels familiar PyTorch autograd concepts, start here:

3.3.11.1. What "adjoint of the Frechet derivative" means here🔗

That phrase can sound more abstract than it really is. In this setting it means:

  • the forward semantics of the tape defines a differentiable map,

  • the reverse pass computes the linear map that sends an output cotangent backward to input cotangents,

  • and that linear map is the adjoint of the derivative of the forward map.

In informal terms, that means:

if the local derivative rules are right, then the whole reverse pass computes the right gradient.

That is the exact theorem shape that makes the loss.backward() mental model rigorous.

3.3.12. Debugging and Teaching🔗

The runtime layer is where widgets pay off:

  • #tape_view and #tape_grads_view show recorded graphs and gradients.

  • #tape_trace_view explains the reverse pass step by step (upstream cotangents and parent contributions).

  • #runtime_ctx_view shows the value/gradient registries.

The Widgets guide and editor demo show these views in practice:

3.3.13. Training Demos (Where To Look)🔗

TorchLean includes several PyTorch style training tutorials with selectable backends:

3.3.13.1. Simple MLP regression🔗

See Example Walkthroughs for the canonical commands and recommended flags.

3.3.13.2. Simple CNN and minibatch MLP🔗

3.3.13.3. Loader-based demos (CSV/NPY) and CIFAR10 NPY CNN🔗

These typically default to the compiled backend for speed; see each file's header comment for the recommended flags (some use --dtype float32 explicitly).

3.3.14. Optional CUDA / GPU execution🔗

Most of the semantic and verification layer in TorchLean is still pure Lean: specs, NN.IR.Graph, bound propagation, and theorems do not require a GPU. For training and large forward passes, the project can be linked against CUDA native code so float32 work runs on device buffers instead of only on host-side Float arrays.

This section is intentionally short. The full CUDA map is in GPU and CUDA.

3.3.14.1. Building with CUDA🔗

From the repository root, a typical GPU-enabled build is:

lake build -R -K cuda=true

Optional: -K cuda_home=/path/to/cuda if the toolkit is not under /usr/local/cuda (see the root lakefile.lean for the exact contract). The build links libcudart and libcublas and compiles the native CUDA sources in csrc/cuda. Without cuda=true, the same external symbols are provided by CPU stubs so lake build and CPU demos keep working everywhere.

3.3.14.2. What runs on the GPU🔗

The CUDA path implements pieces of the float32 buffer runtime and selected kernels (tensor primitives, conv/pool, BLAS, attention-related kernels, etc., depending on the build). Eager tapes, compiled GraphData, and the operation tagged IR used by verifiers are still Lean objects: GPU work accelerates the numeric substrate where the runtime lowers to Cuda.Buffer and friends.

Mental model: think of GPU mode as "fast Float tensor buffers + kernels", not "the proofs run on the GPU." Certificate checks and NN.IR denotations are still reasoned about in Lean.

3.3.14.3. CLI device flags (examples and torchlean)🔗

Many curated demos accept --cpu or --cuda (and forward them through the torchlean runner). After a CUDA build, try for example:

lake exe torchlean mlp --cuda --steps 20
lake exe torchlean gpt2 --cuda --steps 1
lake exe torchlean mamba --cuda --tiny-shakespeare --steps 25

Most curated commands print their flags through the torchlean runner, and Example Walkthroughs keeps the common recipes together. If a run fails, confirm whether the failure is "CUDA not built" (rebuild with -K cuda=true) versus an unsupported op on the compiled path for that demo.

3.3.14.4. Determinism and environment🔗

Some CUDA reductions use atomicAdd, which can be order-dependent. TorchLean exposes a deterministic reductions mode (tunable from Lean and with environment-variable parity in the native layer) to trade some performance for reproducible accumulations. This matters when comparing training curves run to run.

The authoritative list of trust boundary declarations (opaque types, external symbols, and oracle interfaces) is NN.Trust in the library.

3.3.15. Runtime Notes🔗

3.3.15.1. Performance expectations🔗

TorchLean is not a replacement for PyTorch on raw throughput, but the optional GPU backend narrows the gap for float32 tensor work while keeping Lean in the loop for semantics and inspection.

  • The runtime is designed for semantic clarity and debuggability first.

  • The compiled CPU backend is still the main path for graph-shaped execution without CUDA.

  • With CUDA, heavy elementwise, conv, and matmul work can use native kernels; Lean remains the orchestration and typing layer for supported programs.

Rules of thumb:

  • Eager is perfect for stepping through small examples and inspecting grads.

  • Compiled is better for "many steps over the same graph" (training loops, repeated eval).

  • For maximum-throughput production training, many teams will still use PyTorch and use TorchLean for the verification, round-trip, and IR workflows (PyTorch Round-Trip, Two-Stage Workflows). GPU support in TorchLean is there to make Lean-side research demos and experiments feasible, not to compete with PyTorch on every benchmark.

3.3.16. Common Pitfalls🔗

  • Do not assume eager and compiled backends are interchangeable in every corner case. They aim to agree on supported programs, but they expose different debugging surfaces.

  • Do not treat runtime tensors as proof objects. Runtime values are for execution and inspection; proof statements live in the spec/proof layers.

  • Do not forget that heavy correctness proofs can be slow to elaborate. For everyday work, the executable checkers and widgets are usually the faster feedback loop.

To read the runtime layer in dependency order, start with eager tensors and tapes, then compiled graph construction, then the IR execution bridge, and finally the curated model and quickstart examples. That sequence usually preserves the runtime mental model better than jumping straight into the heaviest correctness proofs.

3.3.17. References🔗

  • TorchLean paper (runtime + verification architecture): https://arxiv.org/abs/2602.22631

  • Baydin et al., "Automatic differentiation in machine learning: a survey" (JMLR 2018): https://arxiv.org/abs/1502.05767

  • Griewank and Walther, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, the standard reference for local derivative rules composing into global reverse-mode correctness.

  • PyTorch autograd docs (mental model + terminology): https://pytorch.org/docs/stable/autograd.html

  • PyTorch autograd tutorial (gentle walkthrough): https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html

  • PyTorch FX docs (graph capture mental model): https://pytorch.org/docs/stable/fx.html

  • Runtime widgets: autograd widgets and runtime context widgets.

3.3.17.1. Large Runtime Proofs🔗

Some correctness proofs are large because they connect many operation families at once. The guide separates them from the everyday runtime tutorial, but they remain part of the formal account of the compiled execution path.

The main example is the complete forward correctness development for the IR to compiled bridge, covered by the IRExec correctness API.

For everyday runtime inspection, the faster feedback loop is usually:

  • the executable checkers (Graph.checkWellFormed, Graph.checkShapes, Graph.checkInferredShapes),

  • the widgets (#ir_exec_trace_view, #shape_infer_view),

  • and the smaller correctness lemmas around each supported operation family.

Related manual pages: Widgets (debugging UI), Graphs and IR (shared IR and proved forward fragment), Verification (IBP/CROWN and certificates). Export-focused material continues in PyTorch Round-Trip.