TorchLean API

NN.Examples.Models.Sequence.GptAdder

minGPT-Style Addition Demo #

This file is a TorchLean-native version of the spirit of Karpathy's minGPT/projects/adder experiment. The original minGPT adder trains a compact GPT to complete digit strings of the form

digits(a) ++ digits(b) ++ reverseDigits(a+b).

For example, in the one-digit setting 8 + 7 = 15 is represented as the digit sequence 8 7 5 1. At inference time the model sees 8 7 and greedily generates the two result digits 5 1, which we reverse back to 15.

This is intentionally not a text chatbot. It is a controlled "does the CUDA GPT training loop actually learn an algorithmic task?" walkthrough:

Performance note: this uses the eager CUDA runtime, not a persistent CUDA graph. The heavy tensor operations run on the GPU, including fused attention when --fast-kernels is on, but each step still records a fresh autograd tape and synchronizes parameter refs through the scalar-trainer API. This is the correctness-facing example; full PyTorch-style throughput requires persistent device parameters plus compiled/fused graph execution.

The GPT-shaped architecture is constructed using the shared API helper in NN.API.Models.Gpt2 (nn.models.causalTransformerOneHot), so this file can stay focused on the adder task mechanics.

Reference: https://github.com/karpathy/minGPT/tree/master/projects/adder.

Runner subcommand name.

Instances For

    Number of input digits per operand.

    We start with the one-digit curriculum because it is small enough to train quickly in the eager CUDA runtime while still including carry examples such as 8 + 7 = 15.

    Instances For

      Digit-only vocabulary, matching minGPT's adder task (0..9).

      Instances For

        Full one-digit table batch size.

        This is intentionally 100, not 1: very small scalar workloads underutilize GPUs. In all-pairs mode one optimizer step sees every one-digit addition problem, and evaluation completes the whole table with two batched greedy forward passes.

        Instances For

          Karpathy's adder uses a held-out split; for one digit this is 80 train / 20 test.

          Instances For

            Held-out one-digit examples when --train-split is enabled.

            Instances For

              GPT block size: a, b, and all but the final reversed result digit.

              For ndigit = 1, full rendered examples have length 1 + 1 + 2 = 4; model inputs have length 3, exactly as in minGPT's get_block_size = 3 * ndigit + 1 - 1.

              Instances For

                Number of attention heads.

                Karpathy's minGPT default for the adder is gpt-nano (3 heads, width 48). TorchLean's eager CUDA trainer is tape-based, so we use a middle-sized model that is substantially larger than the original compact setup (1,050 params) while keeping torchlean gpt_adder practical to run.

                Instances For

                  Transformer embedding width.

                  Instances For

                    Feed-forward hidden width (4 * dModel, matching the common GPT MLP ratio).

                    Instances For

                      Number of Transformer blocks.

                      Instances For

                        Number of positions per row that contribute to the minGPT adder loss.

                        Instances For

                          Number of non-ignored next-token targets in the training batch.

                          The adder loss below masks ignored prefix positions to all-zero targets, then computes summed one-hot cross-entropy divided by this count. That matches minGPT's ignore_index=-1 normalization: average over active next-token labels, not over every (batch, position, vocab) entry.

                          Instances For

                            Count scalar entries across a list of parameter shapes.

                            Instances For
                              @[reducible, inline]
                              Instances For
                                @[reducible, inline]
                                Instances For

                                  Compact GPT-style causal Transformer for digit addition.

                                  Instances For

                                    Number of trainable scalar parameters in the current compile-time adder model.

                                    Instances For

                                      Number of parameter tensors in the current compile-time adder model.

                                      Instances For

                                        Cross-entropy summed over non-ignored adder targets, normalized like minGPT ignore_index.

                                        Instances For

                                          Adder-specific scalar loss.

                                          Ignored prefix positions are encoded by all-zero one-hot rows (maskAdderTargets), so they contribute exactly zero to one-hot cross entropy. We divide the summed loss by the number of active target positions, matching minGPT's ignore_index-style normalization rather than averaging over ignored prefix rows.

                                          Instances For

                                            Render n as exactly width base-10 digits, most-significant first.

                                            Instances For

                                              minGPT adder rendering.

                                              For ndigit = 1, a = 8, b = 7 becomes [8, 7, 5, 1], i.e. the sum 15 is stored reversed as 5, 1. Reversing the output digits makes carry propagation local in left-to-right generation.

                                              Instances For

                                                Karpathy/minGPT masks the loss on the operand-prefix positions.

                                                In projects/adder/adder.py, the target vector y is shifted by one token and then y[:ndigit*2-1] = -1, where -1 is PyTorch's "ignore index" for cross entropy. TorchLean's current one-hot cross entropy does not have an ignore-index target, so we represent the same idea by using an all-zero one-hot vector on ignored positions. Because the loss is -sum(y * log p), these positions contribute exactly zero gradient.

                                                Instances For

                                                  Apply the minGPT adder loss mask to a shifted one-hot target matrix.

                                                  Instances For

                                                    Build one supervised next-digit sample from an addition problem.

                                                    Instances For

                                                      Deterministic exhaustive one-digit dataset order.

                                                      Instances For

                                                        Training row assignment. In split mode, rows repeat the first 80 train examples.

                                                        Instances For

                                                          Parse a+b into a probe pair; returns none for non-one-digit prompts.

                                                          Instances For

                                                            Comma-separated list of one-digit a+b probes.

                                                            Instances For

                                                              Build a batched supervised sample with one row per one-digit addition problem.

                                                              Instances For

                                                                Decode reversed generated result digits back into a natural number.

                                                                Instances For

                                                                  Argmax token id at sequence position pos.

                                                                  Instances For

                                                                    Argmax token id at a sequence position for a chosen batch row.

                                                                    Instances For

                                                                      Build a model input tensor from the current generated digit prefix.

                                                                      Instances For

                                                                        Build a batched model input from one digit prefix per row.

                                                                        Instances For

                                                                          Run a model forward through the eager runtime and return logits.

                                                                          We keep this helper local instead of importing the larger GPT-2 example module. That keeps the adder executable focused on the task-specific data and evaluation path.

                                                                          Instances For

                                                                            Greedily complete ndigit + 1 result digits from the operand digits.

                                                                            The key detail is that when the current prefix has length k, the next-token prediction lives at position k - 1, not always at the final padded position.

                                                                            Instances For

                                                                              Predict a + b by greedy decoding and reversing the minGPT result digits.

                                                                              Instances For
                                                                                • trainCorrect :
                                                                                • testCorrect :
                                                                                • allCorrect :
                                                                                Instances For

                                                                                  Evaluate all 100 additions with batched greedy decoding.

                                                                                  For one-digit operands, generation needs two result digits. We first predict the ones digit from rows [a,b], append it, and then predict the carry/tens digit from rows [a,b,pred₀].

                                                                                  Instances For

                                                                                    Print one addition probe in the same digit convention used for training.

                                                                                    Instances For

                                                                                      Optimizer choice for this addition run.

                                                                                      Instances For

                                                                                        Local options for the adder runner.

                                                                                        • steps :

                                                                                          Number of optimizer steps.

                                                                                        • logEvery :

                                                                                          Print loss every logEvery steps.

                                                                                        • logPath : System.FilePath

                                                                                          JSON training log artifact path.

                                                                                        • optim : OptimKind

                                                                                          Optimizer.

                                                                                          adamw is closest to minGPT's adder recipe. adam and sgd are kept for debugging and comparisons.

                                                                                        • lr : Float

                                                                                          Learning rate.

                                                                                        • a :

                                                                                          Probe operand a.

                                                                                        • b :

                                                                                          Probe operand b.

                                                                                        • probes : List ( × )

                                                                                          Extra comma-separated prompt probes, e.g. 0+0,4+5,9+9.

                                                                                        • trainSplit : Bool

                                                                                          Train on an 80/20 train/test split instead of all 100 one-digit additions.

                                                                                        • overfitProbe : Bool

                                                                                          Train only the probe pair, useful for verifying the CUDA GPT can overfit one addition.

                                                                                        • interactive : Bool

                                                                                          Keep the trained CUDA model alive and read a+b prompts from stdin.

                                                                                        Instances For

                                                                                          Force CUDA and fused kernels, because this example is meant to exercise the GPU path.

                                                                                          Instances For

                                                                                            Train the minGPT-style adder from scratch and report exact addition accuracy.

                                                                                            Instances For