TorchLean API

Docs Home Guide Examples Graphs

NN.Examples.Models.Sequence.GptAdder

minGPT-Style Addition Demo #

This file is a TorchLean-native version of the spirit of Karpathy's minGPT/projects/adder experiment. The original minGPT adder trains a compact GPT to complete digit strings of the form

digits(a) ++ digits(b) ++ reverseDigits(a+b).

For example, in the one-digit setting 8 + 7 = 15 is represented as the digit sequence 8 7 5 1. At inference time the model sees 8 7 and greedily generates the two result digits 5 1, which we reverse back to 15.

This is intentionally not a text chatbot. It is a controlled "does the CUDA GPT training loop actually learn an algorithmic task?" walkthrough:

synthetic data is generated in Lean,
the model is a GPT-style causal Transformer built from TorchLean layers,
training is CUDA-only by default,
optimizer choices are deliberately close to minGPT (adamw, adam, or sgd),
evaluation greedily completes every one-digit addition problem.

Performance note: this uses the eager CUDA runtime, not a persistent CUDA graph. The heavy tensor operations run on the GPU, including fused attention when --fast-kernels is on, but each step still records a fresh autograd tape and synchronizes parameter refs through the scalar-trainer API. This is the correctness-facing example; full PyTorch-style throughput requires persistent device parameters plus compiled/fused graph execution.

The GPT-shaped architecture is constructed using the shared API helper in NN.API.Models.Gpt2 (nn.models.causalTransformerOneHot), so this file can stay focused on the adder task mechanics.

Reference: https://github.com/karpathy/minGPT/tree/master/projects/adder.

def NN.Examples.Models.Sequence.GptAdder.exeName :

Runner subcommand name.

Instances For

def NN.Examples.Models.Sequence.GptAdder.defaultLogJson :

System.FilePath

Instances For

def NN.Examples.Models.Sequence.GptAdder.ndigit :

Number of input digits per operand.

We start with the one-digit curriculum because it is small enough to train quickly in the eager CUDA runtime while still including carry examples such as 8 + 7 = 15.

Instances For

def NN.Examples.Models.Sequence.GptAdder.vocab :

Digit-only vocabulary, matching minGPT's adder task (0..9).

Instances For

def NN.Examples.Models.Sequence.GptAdder.batch :

Full one-digit table batch size.

This is intentionally 100, not 1: very small scalar workloads underutilize GPUs. In all-pairs mode one optimizer step sees every one-digit addition problem, and evaluation completes the whole table with two batched greedy forward passes.

Instances For

def NN.Examples.Models.Sequence.GptAdder.trainCount :

Karpathy's adder uses a held-out split; for one digit this is 80 train / 20 test.

Instances For

def NN.Examples.Models.Sequence.GptAdder.testCount :

Held-out one-digit examples when --train-split is enabled.

Instances For

def NN.Examples.Models.Sequence.GptAdder.seqLen :

GPT block size: a, b, and all but the final reversed result digit.

For ndigit = 1, full rendered examples have length 1 + 1 + 2 = 4; model inputs have length 3, exactly as in minGPT's get_block_size = 3 * ndigit + 1 - 1.

Instances For

def NN.Examples.Models.Sequence.GptAdder.numHeads :

Number of attention heads.

Karpathy's minGPT default for the adder is gpt-nano (3 heads, width 48). TorchLean's eager CUDA trainer is tape-based, so we use a middle-sized model that is substantially larger than the original compact setup (1,050 params) while keeping torchlean gpt_adder practical to run.

Instances For

def NN.Examples.Models.Sequence.GptAdder.headDim :

Per-head width.

Instances For

def NN.Examples.Models.Sequence.GptAdder.dModel :

Transformer embedding width.

Instances For

def NN.Examples.Models.Sequence.GptAdder.ffnHidden :

Feed-forward hidden width (4 * dModel, matching the common GPT MLP ratio).

Instances For

def NN.Examples.Models.Sequence.GptAdder.layers :

Number of Transformer blocks.

Instances For

def NN.Examples.Models.Sequence.GptAdder.activeTargetPositions :

Number of positions per row that contribute to the minGPT adder loss.

Instances For

def NN.Examples.Models.Sequence.GptAdder.activeTargetCount :

Number of non-ignored next-token targets in the training batch.

The adder loss below masks ignored prefix positions to all-zero targets, then computes summed one-hot cross-entropy divided by this count. That matches minGPT's ignore_index=-1 normalization: average over active next-token labels, not over every (batch, position, vocab) entry.

Instances For

def NN.Examples.Models.Sequence.GptAdder.paramCountShapes :

List Shape → ℕ

Count scalar entries across a list of parameter shapes.

Instances For

theorem NN.Examples.Models.Sequence.GptAdder.instNeZeroNatSeqLen :

theorem NN.Examples.Models.Sequence.GptAdder.instNeZeroNatDModel :

def NN.Examples.Models.Sequence.GptAdder.cfg :

API.nn.models.CausalOneHotConfig

Instances For

@[reducible, inline]

abbrev NN.Examples.Models.Sequence.GptAdder.σ :

Instances For

@[reducible, inline]

abbrev NN.Examples.Models.Sequence.GptAdder.τ :

Instances For

def NN.Examples.Models.Sequence.GptAdder.mkModel :

API.nn.M (API.nn.Sequential σ τ)

Compact GPT-style causal Transformer for digit addition.

Instances For

def NN.Examples.Models.Sequence.GptAdder.modelParamCount :

Number of trainable scalar parameters in the current compile-time adder model.

Instances For

def NN.Examples.Models.Sequence.GptAdder.modelParamTensorCount :

Number of parameter tensors in the current compile-time adder model.

Instances For

def NN.Examples.Models.Sequence.GptAdder.adderLoss {α : Type} [Context α] [DecidableEq Shape] {m : Type → Type} [Monad m] [Runtime.Autograd.Torch.Ops m α] (logits targetOneHot : Runtime.Autograd.TorchLean.RefTy m α τ) :

m (Runtime.Autograd.TorchLean.RefTy m α Spec.Shape.scalar)

Cross-entropy summed over non-ignored adder targets, normalized like minGPT ignore_index.

Instances For

def NN.Examples.Models.Sequence.GptAdder.adderScalarModuleDef (model : API.nn.Sequential σ τ) :

Runtime.Autograd.TorchLean.ScalarModuleDef (Runtime.Autograd.TorchLean.NN.Seq.paramShapes model) [σ , τ ]

Adder-specific scalar loss.

Ignored prefix positions are encoded by all-zero one-hot rows (maskAdderTargets), so they contribute exactly zero to one-hot cross entropy. We divide the summed loss by the number of active target positions, matching minGPT's ignore_index-style normalization rather than averaging over ignored prefix rows.

Instances For

def NN.Examples.Models.Sequence.GptAdder.fixedDigits (width n : ℕ) :

Render n as exactly width base-10 digits, most-significant first.

Instances For

def NN.Examples.Models.Sequence.GptAdder.renderExample (a b : ℕ) :

minGPT adder rendering.

For ndigit = 1, a = 8, b = 7 becomes [8, 7, 5, 1], i.e. the sum 15 is stored reversed as 5, 1. Reversing the output digits makes carry propagation local in left-to-right generation.

Instances For

def NN.Examples.Models.Sequence.GptAdder.keepTargetPosition (t : ℕ) :

Karpathy/minGPT masks the loss on the operand-prefix positions.

In projects/adder/adder.py, the target vector y is shifted by one token and then y[:ndigit*2-1] = -1, where -1 is PyTorch's "ignore index" for cross entropy. TorchLean's current one-hot cross entropy does not have an ignore-index target, so we represent the same idea by using an all-zero one-hot vector on ignored positions. Because the loss is -sum(y * log p), these positions contribute exactly zero gradient.

Instances For

def NN.Examples.Models.Sequence.GptAdder.maskAdderTargets {α : Type} [Zero α] (y : Spec.Tensor α (Tensor.Shape.Mat seqLen vocab)) :

Spec.Tensor α (Tensor.Shape.Mat seqLen vocab)

Apply the minGPT adder loss mask to a shifted one-hot target matrix.

Instances For

def NN.Examples.Models.Sequence.GptAdder.mkSample {α : Type} [API.Semantics.Scalar α] [API.Runtime.Scalar α] (a b : ℕ) :

API.sample.Supervised α σ τ

Build one supervised next-digit sample from an addition problem.

Instances For

def NN.Examples.Models.Sequence.GptAdder.pairAt (i : ℕ) :

Deterministic exhaustive one-digit dataset order.

Instances For

def NN.Examples.Models.Sequence.GptAdder.trainPairAt (trainSplit : Bool) (i : ℕ) :

Training row assignment. In split mode, rows repeat the first 80 train examples.

Instances For

def NN.Examples.Models.Sequence.GptAdder.parseProbe? (s : String) :

Option (ℕ × ℕ)

Parse a+b into a probe pair; returns none for non-one-digit prompts.

Instances For

def NN.Examples.Models.Sequence.GptAdder.parseProbeList (s : String) :

Except String (List (ℕ × ℕ))

Comma-separated list of one-digit a+b probes.

Instances For

def NN.Examples.Models.Sequence.GptAdder.mkTrainSample {α : Type} [API.Semantics.Scalar α] [API.Runtime.Scalar α] (trainSplit : Bool) :

API.sample.Supervised α σ τ

Build a batched supervised sample with one row per one-digit addition problem.

Instances For

def NN.Examples.Models.Sequence.GptAdder.decodeResult (revDigits : List ℕ) :

Decode reversed generated result digits back into a natural number.

Instances For

def NN.Examples.Models.Sequence.GptAdder.argmaxAt (logits : Spec.Tensor Float τ) (pos : ℕ) :

Argmax token id at sequence position pos.

Instances For

def NN.Examples.Models.Sequence.GptAdder.argmaxAtBatch (logits : Spec.Tensor Float τ) (bi : Fin batch) (pos : ℕ) :

Argmax token id at a sequence position for a chosen batch row.

Instances For

def NN.Examples.Models.Sequence.GptAdder.inputFromDigits (digits : List ℕ) :

Spec.Tensor Float σ

Build a model input tensor from the current generated digit prefix.

Instances For

def NN.Examples.Models.Sequence.GptAdder.inputFromRows (rows : Fin batch → List ℕ) :

Spec.Tensor Float σ

Build a batched model input from one digit prefix per row.

Instances For

def NN.Examples.Models.Sequence.GptAdder.runtimePredictFloat {σ τ : Shape} (opts : Runtime.Autograd.Torch.Options) (model : API.nn.Sequential σ τ) (m : Runtime.Autograd.TorchLean.ScalarModule Float (Runtime.Autograd.TorchLean.NN.Seq.paramShapes model) [σ, τ]) (x : Spec.Tensor Float σ) :

IO (Spec.Tensor Float τ)

Run a model forward through the eager runtime and return logits.

We keep this helper local instead of importing the larger GPT-2 example module. That keeps the adder executable focused on the task-specific data and evaluation path.

Instances For

def NN.Examples.Models.Sequence.GptAdder.generateResultDigits (opts : Runtime.Autograd.Torch.Options) (model : API.nn.Sequential σ τ) (m : Runtime.Autograd.TorchLean.ScalarModule Float (Runtime.Autograd.TorchLean.NN.Seq.paramShapes model) [σ , τ ]) (a b : ℕ) :

Greedily complete ndigit + 1 result digits from the operand digits.

The key detail is that when the current prefix has length k, the next-token prediction lives at position k - 1, not always at the final padded position.

Instances For

def NN.Examples.Models.Sequence.GptAdder.predictSum (opts : Runtime.Autograd.Torch.Options) (model : API.nn.Sequential σ τ) (m : Runtime.Autograd.TorchLean.ScalarModule Float (Runtime.Autograd.TorchLean.NN.Seq.paramShapes model) [σ , τ ]) (a b : ℕ) :

Predict a + b by greedy decoding and reversing the minGPT result digits.

Instances For

def NN.Examples.Models.Sequence.GptAdder.evalAllSlow (opts : Runtime.Autograd.Torch.Options) (model : API.nn.Sequential σ τ) (m : Runtime.Autograd.TorchLean.ScalarModule Float (Runtime.Autograd.TorchLean.NN.Seq.paramShapes model) [σ , τ ]) :

Evaluate all 100 one-digit additions.

Instances For

structure NN.Examples.Models.Sequence.GptAdder.EvalScore :

trainCorrect : ℕ
testCorrect : ℕ
allCorrect : ℕ

Instances For

@[implicit_reducible]

instance NN.Examples.Models.Sequence.GptAdder.instReprEvalScore :

def NN.Examples.Models.Sequence.GptAdder.instReprEvalScore.repr :

EvalScore → ℕ → Std.Format

Instances For

def NN.Examples.Models.Sequence.GptAdder.evalBatched (opts : Runtime.Autograd.Torch.Options) (model : API.nn.Sequential σ τ) (m : Runtime.Autograd.TorchLean.ScalarModule Float (Runtime.Autograd.TorchLean.NN.Seq.paramShapes model) [σ , τ ]) :

Evaluate all 100 additions with batched greedy decoding.

For one-digit operands, generation needs two result digits. We first predict the ones digit from rows [a,b], append it, and then predict the carry/tens digit from rows [a,b,pred₀].

Instances For

def NN.Examples.Models.Sequence.GptAdder.evalAllBatched (opts : Runtime.Autograd.Torch.Options) (model : API.nn.Sequential σ τ) (m : Runtime.Autograd.TorchLean.ScalarModule Float (Runtime.Autograd.TorchLean.NN.Seq.paramShapes model) [σ , τ ]) :

Instances For

def NN.Examples.Models.Sequence.GptAdder.printProbe (opts : Runtime.Autograd.Torch.Options) (model : API.nn.Sequential σ τ) (m : Runtime.Autograd.TorchLean.ScalarModule Float (Runtime.Autograd.TorchLean.NN.Seq.paramShapes model) [σ , τ ]) (a b : ℕ) :

Print one addition probe in the same digit convention used for training.

Instances For

inductive NN.Examples.Models.Sequence.GptAdder.OptimKind :

Optimizer choice for this addition run.

sgd : OptimKind
adam : OptimKind
adamw : OptimKind

Instances For

@[implicit_reducible]

instance NN.Examples.Models.Sequence.GptAdder.instDecidableEqOptimKind :

DecidableEq OptimKind

def NN.Examples.Models.Sequence.GptAdder.instReprOptimKind.repr :

OptimKind → ℕ → Std.Format

Instances For

@[implicit_reducible]

instance NN.Examples.Models.Sequence.GptAdder.instReprOptimKind :

def NN.Examples.Models.Sequence.GptAdder.OptimKind.parse (s : String) :

Except String OptimKind

Instances For

def NN.Examples.Models.Sequence.GptAdder.OptimKind.name :

OptimKind → String

Instances For

structure NN.Examples.Models.Sequence.GptAdder.TrainOptions :

Local options for the adder runner.

steps : ℕ
Number of optimizer steps.
logEvery : ℕ
Print loss every logEvery steps.
logPath : System.FilePath
JSON training log artifact path.
optim : OptimKind
Optimizer.
adamw is closest to minGPT's adder recipe. adam and sgd are kept for debugging and comparisons.
lr : Float
Learning rate.
a : ℕ
Probe operand a.
b : ℕ
Probe operand b.
probes : List (ℕ × ℕ)
Extra comma-separated prompt probes, e.g. 0+0,4+5,9+9.
trainSplit : Bool
Train on an 80/20 train/test split instead of all 100 one-digit additions.
overfitProbe : Bool
Train only the probe pair, useful for verifying the CUDA GPT can overfit one addition.
interactive : Bool
Keep the trained CUDA model alive and read a+b prompts from stdin.

Instances For

@[implicit_reducible]

instance NN.Examples.Models.Sequence.GptAdder.instReprTrainOptions :

Repr TrainOptions

def NN.Examples.Models.Sequence.GptAdder.instReprTrainOptions.repr :

TrainOptions → ℕ → Std.Format

Instances For

def NN.Examples.Models.Sequence.GptAdder.parseTrainOptions (args : List String) :

Except String (TrainOptions × List String)

Parse adder-specific CLI options.

Instances For

def NN.Examples.Models.Sequence.GptAdder.forceCudaArgs (args : List String) :

Except String (List String)

Force CUDA and fused kernels, because this example is meant to exercise the GPU path.

Instances For

def NN.Examples.Models.Sequence.GptAdder.forceCudaArgs.hasCompiledBackend :

List String → Bool

Instances For

def NN.Examples.Models.Sequence.GptAdder.interactiveLoop (opts : Runtime.Autograd.Torch.Options) (model : API.nn.Sequential σ τ) (m : Runtime.Autograd.TorchLean.ScalarModule Float (Runtime.Autograd.TorchLean.NN.Seq.paramShapes model) [σ , τ ]) :

Simple terminal REPL for the trained CUDA model.

Instances For

partial def NN.Examples.Models.Sequence.GptAdder.interactiveLoop.loop (opts : Runtime.Autograd.Torch.Options) (model : API.nn.Sequential σ τ) (m : Runtime.Autograd.TorchLean.ScalarModule Float (Runtime.Autograd.TorchLean.NN.Seq.paramShapes model) [σ , τ ]) (stdin : IO.FS.Stream) :

def NN.Examples.Models.Sequence.GptAdder.trainAdderFloat (opts : Runtime.Autograd.Torch.Options) (trainOpts : TrainOptions) :

Train the minGPT-style adder from scratch and report exact addition accuracy.

Instances For

def NN.Examples.Models.Sequence.GptAdder.main (args : List String) :

Instances For