minGPT-Style Addition Example #
This is a TorchLean-native version of the spirit of Karpathy's minGPT/projects/adder
experiment. The original minGPT adder trains a compact GPT to complete digit strings of the form
digits(a) ++ digits(b) ++ reverseDigits(a+b).
For example, in the one-digit setting 8 + 7 = 15 is represented as the digit sequence
8 7 5 1. At inference time the model sees 8 7 and greedily generates the two result digits
5 1, which we reverse back to 15.
This is not a text chatbot. It is a controlled algorithmic sequence task for the CUDA GPT training loop:
- synthetic data is generated in Lean,
- the model is a GPT-style causal Transformer built from TorchLean layers,
- training is CUDA-only by default,
- optimizer choices follow the minGPT-style setup (
adamw,adam, orsgd), - evaluation greedily completes every one-digit addition problem.
Performance note: this uses the eager CUDA runtime, not a persistent CUDA graph.
The heavy tensor operations run on the GPU, including fused attention when --fast-kernels is on,
but each step still records a fresh autograd tape and synchronizes parameter refs through the
current scalar training bridge. This is the correctness-facing example; full PyTorch-style
throughput requires persistent device parameters plus compiled/fused graph execution.
The GPT-shaped architecture is constructed through the public TorchLean model constructor
nn.models.CausalTransformerOneHot, so the example can stay focused on the adder task mechanics.
Reference: https://github.com/karpathy/minGPT/tree/master/projects/adder.
CLI subcommand label used by the shared model runner.
Instances For
Default JSON loss-curve path for this command.
Instances For
Number of input digits per operand.
We start with the one-digit curriculum because it trains directly in the eager CUDA runtime while
still including carry examples such as 8 + 7 = 15.
Instances For
Digit-only vocabulary, matching minGPT's adder task (0..9).
Instances For
Full one-digit table batch size.
This is 100, not 1: scalar-sized GPU workloads underutilize the device. In all-pairs mode one
optimizer step sees every one-digit addition problem, and evaluation completes the whole table with
two batched greedy forward passes.
Instances For
Karpathy's adder uses a held-out split; for one digit this is 80 train / 20 test.
Instances For
Held-out one-digit examples when --train-split is enabled.
Instances For
Number of attention heads.
Karpathy's minGPT default for the adder is gpt-nano (3 heads, width 48). TorchLean's eager
CUDA trainer is tape-based, so we use a middle-sized model that is substantially larger than the
original compact setup (1,050 params) while keeping torchlean gpt_adder practical to run.
Instances For
Feed-forward hidden width (4 * dModel, matching the common GPT MLP ratio).
Instances For
Number of positions per row that contribute to the minGPT adder loss.
Instances For
Number of non-ignored next-token targets in the training batch.
The adder loss below masks ignored prefix positions to all-zero targets, then computes summed
one-hot cross-entropy divided by this count. That matches minGPT's ignore_index=-1 normalization:
average over active next-token labels, not over every (batch, position, vocab) entry.
Instances For
Count scalar entries across a list of parameter shapes.
Instances For
GPT configuration shared by the typed shapes and model constructor.
Instances For
Input shape: batched one-hot digit sequences.
Instances For
Output shape: one digit-logit row per input position.
Instances For
Compact GPT-style causal Transformer for digit addition.
Instances For
Cross-entropy summed over non-ignored adder targets, normalized like minGPT ignore_index.
Instances For
Adder-specific scalar loss.
Ignored prefix positions are encoded by all-zero one-hot rows (maskAdderTargets), so they
contribute exactly zero to one-hot cross entropy. We divide the summed loss by the number of
active target positions, matching minGPT's ignore_index-style normalization rather than averaging
over ignored prefix rows.
Instances For
Render n as exactly width base-10 digits, most-significant first.
Instances For
Karpathy/minGPT masks the loss on the operand-prefix positions.
In projects/adder/adder.py, the target vector y is shifted by one token and then
y[:ndigit*2-1] = -1, where -1 is PyTorch's "ignore index" for cross entropy. TorchLean's
current one-hot cross entropy does not have an ignore-index target, so we represent the same idea by
using an all-zero one-hot vector on ignored positions. Because the loss is -sum(y * log p), these
positions contribute exactly zero gradient.
Instances For
Apply the minGPT adder loss mask to a shifted one-hot target matrix.
Instances For
Build one unbatched one-hot causal-LM sample for an addition row, then apply the minGPT-style ignored-prefix mask to its target matrix.
Instances For
Build one supervised next-digit sample from an addition problem.
Instances For
Build a batched supervised sample with one row per one-digit addition problem.
Instances For
Decode reversed generated result digits back into a natural number.
Instances For
Argmax token id at sequence position pos.
Instances For
Argmax token id at a sequence position for a chosen batch row.
Instances For
Build a model input tensor from the current generated digit prefix.
Instances For
Build a batched model input from one digit prefix per row.
Instances For
Fitted adder predictor returned by the public trainer handle.
Instances For
Greedily complete ndigit + 1 result digits from the operand digits.
The key detail is that when the current prefix has length k, the next-token prediction lives at
position k - 1, not always at the final padded position.
Instances For
Evaluate all 100 one-digit additions.
Instances For
Instances For
Evaluate all 100 additions with batched greedy decoding.
For one-digit operands, generation needs two result digits. We first predict the ones digit from
rows [a,b], append it, and then predict the carry/tens digit from rows [a,b,pred₀].
Instances For
Batched exact-match score over all one-digit additions.
Instances For
Adder-specific CLI options layered on top of the shared interactive text training flags.
- optim : TorchLean.optim.Kind
Optimizer.
adamwis closest to minGPT's adder recipe.adamandsgdare useful for debugging and comparisons. - a : ℕ
Operand
aused by the highlighted addition check. - b : ℕ
Operand
bused by the highlighted addition check. Extra comma-separated addition checks, e.g.
0+0,4+5,9+9.- trainSplit : Bool
Train on an 80/20 train/test split instead of all 100 one-digit additions.
- overfitProbe : Bool
Train only the selected pair, useful for checking that the CUDA GPT can overfit one addition.
Instances For
Instances For
Parse adder-specific CLI options.
Instances For
Standard TrainLog notes for the adder training loop.
Instances For
Training/evaluation curriculum used by the adder runner.
- overfitPair : CurriculumMode
- trainSplit : CurriculumMode
- fullTable : CurriculumMode
Instances For
Instances For
Decide which curriculum the current adder options request.
Instances For
Startup note for the selected curriculum.
Instances For
Training sample corresponding to the selected curriculum.
Instances For
Per-step progress line for the selected curriculum.
Instances For
Final evaluation line for the selected curriculum, if any.
Instances For
Simple terminal REPL for the trained CUDA model.
Instances For
Train the minGPT-style adder from scratch and report exact addition accuracy.