TorchLean API

NN.Examples.Models.Sequence.GptAdder

minGPT-Style Addition Example #

This is a TorchLean-native version of the spirit of Karpathy's minGPT/projects/adder experiment. The original minGPT adder trains a compact GPT to complete digit strings of the form

digits(a) ++ digits(b) ++ reverseDigits(a+b).

For example, in the one-digit setting 8 + 7 = 15 is represented as the digit sequence 8 7 5 1. At inference time the model sees 8 7 and greedily generates the two result digits 5 1, which we reverse back to 15.

This is not a text chatbot. It is a controlled algorithmic sequence task for the CUDA GPT training loop:

Performance note: this uses the eager CUDA runtime, not a persistent CUDA graph. The heavy tensor operations run on the GPU, including fused attention when --fast-kernels is on, but each step still records a fresh autograd tape and synchronizes parameter refs through the current scalar training bridge. This is the correctness-facing example; full PyTorch-style throughput requires persistent device parameters plus compiled/fused graph execution.

The GPT-shaped architecture is constructed through the public TorchLean model constructor nn.models.CausalTransformerOneHot, so the example can stay focused on the adder task mechanics.

Reference: https://github.com/karpathy/minGPT/tree/master/projects/adder.

CLI subcommand label used by the shared model runner.

Instances For

    Default JSON loss-curve path for this command.

    Instances For

      Number of input digits per operand.

      We start with the one-digit curriculum because it trains directly in the eager CUDA runtime while still including carry examples such as 8 + 7 = 15.

      Instances For

        Digit-only vocabulary, matching minGPT's adder task (0..9).

        Instances For

          Full one-digit table batch size.

          This is 100, not 1: scalar-sized GPU workloads underutilize the device. In all-pairs mode one optimizer step sees every one-digit addition problem, and evaluation completes the whole table with two batched greedy forward passes.

          Instances For

            Karpathy's adder uses a held-out split; for one digit this is 80 train / 20 test.

            Instances For

              Held-out one-digit examples when --train-split is enabled.

              Instances For

                GPT block size: a, b, and all but the final reversed result digit.

                For ndigit = 1, full rendered examples have length 1 + 1 + 2 = 4; model inputs have length 3, exactly as in minGPT's get_block_size = 3 * ndigit + 1 - 1.

                Instances For

                  Number of attention heads.

                  Karpathy's minGPT default for the adder is gpt-nano (3 heads, width 48). TorchLean's eager CUDA trainer is tape-based, so we use a middle-sized model that is substantially larger than the original compact setup (1,050 params) while keeping torchlean gpt_adder practical to run.

                  Instances For

                    Transformer embedding width.

                    Instances For

                      Feed-forward hidden width (4 * dModel, matching the common GPT MLP ratio).

                      Instances For

                        Number of Transformer blocks.

                        Instances For

                          Number of positions per row that contribute to the minGPT adder loss.

                          Instances For

                            Number of non-ignored next-token targets in the training batch.

                            The adder loss below masks ignored prefix positions to all-zero targets, then computes summed one-hot cross-entropy divided by this count. That matches minGPT's ignore_index=-1 normalization: average over active next-token labels, not over every (batch, position, vocab) entry.

                            Instances For

                              Count scalar entries across a list of parameter shapes.

                              Instances For

                                GPT configuration shared by the typed shapes and model constructor.

                                Instances For
                                  @[reducible, inline]

                                  Input shape: batched one-hot digit sequences.

                                  Instances For
                                    @[reducible, inline]

                                    Output shape: one digit-logit row per input position.

                                    Instances For

                                      Compact GPT-style causal Transformer for digit addition.

                                      Instances For

                                        Cross-entropy summed over non-ignored adder targets, normalized like minGPT ignore_index.

                                        Instances For

                                          Adder-specific scalar loss.

                                          Ignored prefix positions are encoded by all-zero one-hot rows (maskAdderTargets), so they contribute exactly zero to one-hot cross entropy. We divide the summed loss by the number of active target positions, matching minGPT's ignore_index-style normalization rather than averaging over ignored prefix rows.

                                          Instances For

                                            Render n as exactly width base-10 digits, most-significant first.

                                            Instances For

                                              minGPT adder rendering.

                                              For ndigit = 1, a = 8, b = 7 becomes [8, 7, 5, 1], i.e. the sum 15 is stored reversed as 5, 1. Reversing the output digits makes carry propagation local in left-to-right generation.

                                              Instances For

                                                Karpathy/minGPT masks the loss on the operand-prefix positions.

                                                In projects/adder/adder.py, the target vector y is shifted by one token and then y[:ndigit*2-1] = -1, where -1 is PyTorch's "ignore index" for cross entropy. TorchLean's current one-hot cross entropy does not have an ignore-index target, so we represent the same idea by using an all-zero one-hot vector on ignored positions. Because the loss is -sum(y * log p), these positions contribute exactly zero gradient.

                                                Instances For

                                                  Apply the minGPT adder loss mask to a shifted one-hot target matrix.

                                                  Instances For

                                                    Build one unbatched one-hot causal-LM sample for an addition row, then apply the minGPT-style ignored-prefix mask to its target matrix.

                                                    Instances For

                                                      Build one supervised next-digit sample from an addition problem.

                                                      Instances For

                                                        Deterministic exhaustive one-digit dataset order.

                                                        Instances For

                                                          Training row assignment. In split mode, rows repeat the first 80 train examples.

                                                          Instances For

                                                            Parse a+b into a one-digit operand pair; returns none for malformed prompts.

                                                            Instances For

                                                              Comma-separated list of one-digit a+b checks.

                                                              Instances For

                                                                Build a batched supervised sample with one row per one-digit addition problem.

                                                                Instances For

                                                                  Decode reversed generated result digits back into a natural number.

                                                                  Instances For

                                                                    Argmax token id at sequence position pos.

                                                                    Instances For

                                                                      Argmax token id at a sequence position for a chosen batch row.

                                                                      Instances For

                                                                        Build a model input tensor from the current generated digit prefix.

                                                                        Instances For

                                                                          Build a batched model input from one digit prefix per row.

                                                                          Instances For
                                                                            @[reducible, inline]

                                                                            Fitted adder predictor returned by the public trainer handle.

                                                                            Instances For

                                                                              Greedily complete ndigit + 1 result digits from the operand digits.

                                                                              The key detail is that when the current prefix has length k, the next-token prediction lives at position k - 1, not always at the final padded position.

                                                                              Instances For

                                                                                Predict a + b by greedy decoding and reversing the minGPT result digits.

                                                                                Instances For

                                                                                  Evaluate all 100 one-digit additions.

                                                                                  Instances For

                                                                                    Exact-match counts for train/test/all one-digit addition rows.

                                                                                    • trainCorrect :

                                                                                      Correct rows in the training split.

                                                                                    • testCorrect :

                                                                                      Correct rows in the held-out split.

                                                                                    • allCorrect :

                                                                                      Correct rows across all one-digit additions.

                                                                                    Instances For

                                                                                      Evaluate all 100 additions with batched greedy decoding.

                                                                                      For one-digit operands, generation needs two result digits. We first predict the ones digit from rows [a,b], append it, and then predict the carry/tens digit from rows [a,b,pred₀].

                                                                                      Instances For

                                                                                        Batched exact-match score over all one-digit additions.

                                                                                        Instances For

                                                                                          Print one addition check in the same digit convention used for training.

                                                                                          Instances For

                                                                                            Adder-specific CLI options layered on top of the shared interactive text training flags.

                                                                                            Instances For

                                                                                              Default extra addition checks shown after training when --probes is omitted.

                                                                                              Instances For

                                                                                                Standard TrainLog notes for the adder training loop.

                                                                                                Instances For

                                                                                                  Training/evaluation curriculum used by the adder runner.

                                                                                                  Instances For

                                                                                                    Decide which curriculum the current adder options request.

                                                                                                    Instances For

                                                                                                      Startup note for the selected curriculum.

                                                                                                      Instances For

                                                                                                        Training sample corresponding to the selected curriculum.

                                                                                                        Instances For

                                                                                                          Per-step progress line for the selected curriculum.

                                                                                                          Instances For

                                                                                                            Final evaluation line for the selected curriculum, if any.

                                                                                                            Instances For

                                                                                                              Simple terminal REPL for the trained CUDA model.

                                                                                                              Instances For

                                                                                                                Train the minGPT-style adder from scratch and report exact addition accuracy.

                                                                                                                Instances For

                                                                                                                  CLI entrypoint for the CUDA GPT adder command.

                                                                                                                  Instances For