minGPT-Style Addition Demo #
This file is a TorchLean-native version of the spirit of Karpathy's minGPT/projects/adder
experiment. The original minGPT adder trains a compact GPT to complete digit strings of the form
digits(a) ++ digits(b) ++ reverseDigits(a+b).
For example, in the one-digit setting 8 + 7 = 15 is represented as the digit sequence
8 7 5 1. At inference time the model sees 8 7 and greedily generates the two result digits
5 1, which we reverse back to 15.
This is intentionally not a text chatbot. It is a controlled "does the CUDA GPT training loop actually learn an algorithmic task?" walkthrough:
- synthetic data is generated in Lean,
- the model is a GPT-style causal Transformer built from TorchLean layers,
- training is CUDA-only by default,
- optimizer choices are deliberately close to minGPT (
adamw,adam, orsgd), - evaluation greedily completes every one-digit addition problem.
Performance note: this uses the eager CUDA runtime, not a persistent CUDA graph.
The heavy tensor operations run on the GPU, including fused attention when --fast-kernels is on,
but each step still records a fresh autograd tape and synchronizes parameter refs through the
scalar-trainer API. This is the correctness-facing example; full PyTorch-style throughput requires
persistent device parameters plus compiled/fused graph
execution.
The GPT-shaped architecture is constructed using the shared API helper in NN.API.Models.Gpt2
(nn.models.causalTransformerOneHot), so this file can stay focused on the adder task mechanics.
Reference: https://github.com/karpathy/minGPT/tree/master/projects/adder.
Number of input digits per operand.
We start with the one-digit curriculum because it is small enough to train quickly in the
eager CUDA runtime while still including carry examples such as 8 + 7 = 15.
Instances For
Digit-only vocabulary, matching minGPT's adder task (0..9).
Instances For
Full one-digit table batch size.
This is intentionally 100, not 1: very small scalar workloads underutilize GPUs. In all-pairs mode one
optimizer step sees every one-digit addition problem, and evaluation completes the whole table with
two batched greedy forward passes.
Instances For
Karpathy's adder uses a held-out split; for one digit this is 80 train / 20 test.
Instances For
Held-out one-digit examples when --train-split is enabled.
Instances For
Number of attention heads.
Karpathy's minGPT default for the adder is gpt-nano (3 heads, width 48). TorchLean's eager
CUDA trainer is tape-based, so we use a middle-sized model that is substantially larger than the
original compact setup (1,050 params) while keeping torchlean gpt_adder practical to run.
Instances For
Feed-forward hidden width (4 * dModel, matching the common GPT MLP ratio).
Instances For
Number of positions per row that contribute to the minGPT adder loss.
Instances For
Number of non-ignored next-token targets in the training batch.
The adder loss below masks ignored prefix positions to all-zero targets, then computes summed
one-hot cross-entropy divided by this count. That matches minGPT's ignore_index=-1 normalization:
average over active next-token labels, not over every (batch, position, vocab) entry.
Instances For
Count scalar entries across a list of parameter shapes.
Instances For
Compact GPT-style causal Transformer for digit addition.
Instances For
Number of trainable scalar parameters in the current compile-time adder model.
Instances For
Number of parameter tensors in the current compile-time adder model.
Instances For
Cross-entropy summed over non-ignored adder targets, normalized like minGPT ignore_index.
Instances For
Adder-specific scalar loss.
Ignored prefix positions are encoded by all-zero one-hot rows (maskAdderTargets), so they
contribute exactly zero to one-hot cross entropy. We divide the summed loss by the number of
active target positions, matching minGPT's ignore_index-style normalization rather than averaging
over ignored prefix rows.
Instances For
Render n as exactly width base-10 digits, most-significant first.
Instances For
Karpathy/minGPT masks the loss on the operand-prefix positions.
In projects/adder/adder.py, the target vector y is shifted by one token and then
y[:ndigit*2-1] = -1, where -1 is PyTorch's "ignore index" for cross entropy. TorchLean's
current one-hot cross entropy does not have an ignore-index target, so we represent the same idea by
using an all-zero one-hot vector on ignored positions. Because the loss is -sum(y * log p), these
positions contribute exactly zero gradient.
Instances For
Apply the minGPT adder loss mask to a shifted one-hot target matrix.
Instances For
Build one supervised next-digit sample from an addition problem.
Instances For
Build a batched supervised sample with one row per one-digit addition problem.
Instances For
Decode reversed generated result digits back into a natural number.
Instances For
Argmax token id at sequence position pos.
Instances For
Argmax token id at a sequence position for a chosen batch row.
Instances For
Build a model input tensor from the current generated digit prefix.
Instances For
Build a batched model input from one digit prefix per row.
Instances For
Run a model forward through the eager runtime and return logits.
We keep this helper local instead of importing the larger GPT-2 example module. That keeps the adder executable focused on the task-specific data and evaluation path.
Instances For
Greedily complete ndigit + 1 result digits from the operand digits.
The key detail is that when the current prefix has length k, the next-token prediction lives at
position k - 1, not always at the final padded position.
Instances For
Predict a + b by greedy decoding and reversing the minGPT result digits.
Instances For
Evaluate all 100 one-digit additions.
Instances For
Instances For
Evaluate all 100 additions with batched greedy decoding.
For one-digit operands, generation needs two result digits. We first predict the ones digit from
rows [a,b], append it, and then predict the carry/tens digit from rows [a,b,pred₀].
Instances For
Instances For
Print one addition probe in the same digit convention used for training.
Instances For
Instances For
Local options for the adder runner.
- steps : ℕ
Number of optimizer steps.
- logEvery : ℕ
Print loss every
logEverysteps. - logPath : System.FilePath
JSON training log artifact path.
- optim : OptimKind
- lr : Float
Learning rate.
- a : ℕ
Probe operand
a. - b : ℕ
Probe operand
b. Extra comma-separated prompt probes, e.g.
0+0,4+5,9+9.- trainSplit : Bool
Train on an 80/20 train/test split instead of all 100 one-digit additions.
- overfitProbe : Bool
Train only the probe pair, useful for verifying the CUDA GPT can overfit one addition.
- interactive : Bool
Keep the trained CUDA model alive and read
a+bprompts from stdin.
Instances For
Instances For
Parse adder-specific CLI options.
Instances For
Instances For
Simple terminal REPL for the trained CUDA model.
Instances For
Train the minGPT-style adder from scratch and report exact addition accuracy.