TorchLean API

NN.Examples.Models.Sequence.Gpt2

GPT-2-Style Causal Language Model Example #

Runnable torchlean gpt2 example. It builds a small GPT-2-style causal transformer over byte-level tokens, with optional real text input from tiny-shakespeare or --data-file PATH.

If you are looking for the simplest "Karpathy-style single text file" path, start with torchlean chargpt (character-level tokenizer). This gpt2 example is byte-level and is meant to show the Transformer block wiring and save/reload loop.

python3 scripts/datasets/download_example_data.py --tiny-shakespeare
lake build -R -K cuda=true && lake exe torchlean gpt2 --cuda --tiny-shakespeare --steps 100

Small batch size.

The executable intentionally overfits a small real-text slice rather than presenting it as a full pretraining run: it shows the full TorchLean stack can run a causal Transformer, update parameters, and decode logits back to text.

Instances For

    Prompt/target window length.

    Sixty-four byte tokens is still small enough for local eager-CUDA runs, but it gives the miniature Transformer enough local context to learn short names, line breaks, speaker prefixes, and a little phrase structure in Tiny Shakespeare. Shorter windows are useful for parser/kernel checks but underrepresent the model stack during text generation.

    Instances For

      Byte-level vocabulary size. Each UTF-8 byte is one token.

      Instances For

        Number of attention heads in the miniature Transformer block.

        Instances For

          Per-head embedding width. The model dimension is numHeads * headDim.

          We keep the default small so the tutorial finishes locally. A wider dModel = 64 variant runs, but in the current eager-CUDA training loop it is slower and did not improve the 2k-step Shakespeare sample enough to justify making it the default. Use this file to inspect Transformer/autograd behavior; use the Mamba example when the goal is the cleanest compact text sample.

          Instances For

            Transformer embedding width.

            Instances For

              Hidden width of the feed-forward sublayer.

              Instances For

                Number of Transformer encoder blocks.

                Instances For

                  Conventional local path for the Tiny Shakespeare text corpus.

                  Instances For

                    Conventional local path for the TinyStories validation slice.

                    Instances For

                      Shared data-preparation hint for the GPT text examples.

                      Instances For
                        @[reducible, inline]
                        Instances For
                          @[reducible, inline]
                          Instances For

                            Build a batch sample from per-row token windows.

                            idsByBatch[i] is the (seqLen + 1)-token window for batch row i. If fewer than batch windows are provided we repeat the last one; callers should normally pass exactly batch windows.

                            Instances For

                              Build one next-token-prediction sample from text.

                              Instances For

                                Parse GPT-2-specific data flags and return the training corpus plus remaining runtime flags.

                                Instances For
                                  Instances For

                                    Print a compact before/after language-model probe for the first batch row.

                                    Instances For

                                      Apply a lightweight repetition penalty during decoding.

                                      This is intentionally a generation-side control, not a training shortcut. This compact GPT-2-style example can learn the local next-token objective but still fall into byte-level loops such as oooooo; reducing the logits of recently emitted bytes makes the example's sampled text reflect more of the learned distribution instead of the first local attractor it finds.

                                      Instances For
                                        def NN.Examples.Models.Sequence.Gpt2.greedyTokenAt (logits : Spec.Tensor Float σ) (pos : ) (recent : List := []) (repeatPenalty : Float := 0.0) (asciiOnly : Bool := false) :
                                        Instances For
                                          def NN.Examples.Models.Sequence.Gpt2.sampleFromLogitsAt (logits : Spec.Tensor Float σ) (pos : ) (temperature : Float) (topK seed counter : ) (recent : List := []) (repeatPenalty : Float := 0.0) (asciiOnly : Bool := false) :
                                          Instances For
                                            def NN.Examples.Models.Sequence.Gpt2.generateSampledFromIds (opts : Runtime.Autograd.Torch.Options) (model : API.nn.Sequential σ τ) (params : Runtime.Autograd.Torch.ParamList Float (Runtime.Autograd.TorchLean.NN.Seq.paramShapes model)) (promptIds : List ) (steps : ) (temperature : Float) (topK seed repeatWindow : ) (repeatPenalty : Float) (asciiOnly : Bool) :
                                            Instances For
                                              partial def NN.Examples.Models.Sequence.Gpt2.generateSampledFromIds.loop (opts : Runtime.Autograd.Torch.Options) (model : API.nn.Sequential σ τ) (params : Runtime.Autograd.Torch.ParamList Float (Runtime.Autograd.TorchLean.NN.Seq.paramShapes model)) (steps : ) (temperature : Float) (topK seed repeatWindow : ) (repeatPenalty : Float) (asciiOnly : Bool) (ids : List ) :
                                              IO (List )
                                              def NN.Examples.Models.Sequence.Gpt2.generateSampled (opts : Runtime.Autograd.Torch.Options) (model : API.nn.Sequential σ τ) (params : Runtime.Autograd.Torch.ParamList Float (Runtime.Autograd.TorchLean.NN.Seq.paramShapes model)) (prompt : String) (steps : ) (temperature : Float) (topK seed repeatWindow : ) (repeatPenalty : Float) (asciiOnly : Bool) :
                                              Instances For

                                                Compact interactive prompt loop for the in-memory Float model.

                                                This is a diagnostic REPL, not pretrained text generation. Each line is interpreted as one causal LM window, and the model prints the per-position argmax prediction for that window.

                                                Instances For

                                                  Float-specialized training path with decoded prediction probes.

                                                  The CUDA executable uses Lean Float tensors, so this branch can show actual prompt, target, and predicted text before and after training. The polymorphic path above is still used for non-Float dtype smoke runs.

                                                  Instances For