TorchLean API

NN.Examples.Models.Sequence.TextGpt2

GPU GPT-2 Corpus Trainer #

This file trains GPT-2-style models from text in TorchLean.

The model is initialized inside TorchLean and trained by the TorchLean runtime. It does not load a pretrained PyTorch/Hugging Face checkpoint:

The default path is byte-level because it is compact and fast. Passing --bpe-vocab and --bpe-merges switches to the Lean-native GPT-2 BPE tokenizer, using the standard 50,257-way GPT-2 token vocabulary. That BPE path is still training from scratch in TorchLean; it does not load a pretrained checkpoint.

Runner subcommand name. This subcommand trains a GPT-2-style model from scratch.

Instances For

    Minimum corpus size for the default public training path: 100 MiB.

    Instances For

      Default byte-level context window for the CUDA corpus trainer.

      Keeping this near the file top lets corpus validation and the model architecture agree without depending on declaration order.

      Instances For

        Parsed local options for the corpus trainer.

        • dataFile : System.FilePath

          UTF-8 or raw-byte text corpus.

        • Shared step count and TrainLog destination.

        • finetuneFile? : Option System.FilePath

          Optional second corpus for fine-tuning after the main corpus pass.

        • finetuneSteps :

          Number of optimizer steps on the fine-tuning corpus.

        • logEvery :

          Print loss every logEvery steps. 0 disables progress logging.

        • allowSmallData : Bool

          Allow small files for bounded local checks.

        • Optional GPT-2 vocab.json path. Supplying this plus bpeMerges? enables BPE mode.

        • bpeMerges? : Option System.FilePath

          Optional GPT-2 merges.txt path. Supplying this plus bpeVocab? enables BPE mode.

        • prompt : String

          Prompt used for the post-training generation probe.

        • generate :

          Number of autoregressive BPE tokens to generate in the post-training probe.

        • interactive : Bool

          Keep the trained CUDA model alive and read prompts from stdin.

        • maxChars? : Option

          Optional text-character cap for bounded BPE runs.

        Instances For

          Parse options owned by this example; runtime flags are parsed by TorchLean.Module.run.

          Instances For

            Force the runner into the intended CUDA configuration.

            Users should not have to remember --cuda --fast-kernels for this example. We still reject --cpu explicitly because silently switching to CPU would make a large text-training run look hung rather than correctly configured.

            Instances For

              Read the primary raw text corpus.

              Instances For

                Build a supervised next-token sample from already-tokenized ids.

                Instances For
                  def NN.Examples.Models.Sequence.TextGpt2.mkSampleFromTokenRowsWith {α : Type} [API.Semantics.Scalar α] [API.Runtime.Scalar α] {batch seqLen vocab : } (tokensAt : Fin batchList ) :
                  API.sample.Supervised α (Tensor.shapeOfDims [batch, seqLen, vocab]) (Tensor.shapeOfDims [batch, seqLen, vocab])
                  Instances For

                    Byte-level vocabulary: one token per byte.

                    Instances For

                      Single-sequence batches keep the example small and fully interactive.

                      Instances For

                        Interactive context window.

                        This shares the folder-level byte context constant so corpus validation, byte training, and BPE training use the same tensor layout. Larger windows require more allocator headroom, not something we should quietly make the default before allocator pressure is solved.

                        Instances For

                          Small two-head Transformer width.

                          Instances For

                            Transformer embedding width.

                            Instances For

                              Feed-forward hidden width.

                              Instances For

                                Number of Transformer blocks.

                                Instances For
                                  @[reducible, inline]
                                  Instances For
                                    @[reducible, inline]
                                    Instances For

                                      Runnable byte-level GPT-style model for corpus pretraining/fine-tuning.

                                      This is deliberately compact, but it has enough context to make the interactive prompt loop useful for quick local experiments.

                                      Instances For

                                        Compact vocabulary used by the runnable BPE training path.

                                        The tokenizer still uses GPT-2's real 50,257-token BPE files. For the small Lean/CUDA smoke model we project the corpus tokens into a local vocabulary of the first observed BPE ids. This keeps the example interactive while preserving the tokenizer/data path; a full 50k-way output head is a much larger training run.

                                        Instances For

                                          Keep the BPE smoke model aligned with the small byte-level GPT-2 path.

                                          Instances For

                                            Short context window used by the trainer.

                                            Instances For

                                              Number of attention heads in the miniature BPE Transformer.

                                              Instances For

                                                Per-head width. The model is intentionally compact even though the vocabulary is real GPT-2.

                                                Instances For

                                                  Transformer embedding width.

                                                  Instances For

                                                    Feed-forward hidden width.

                                                    Instances For

                                                      Number of Transformer blocks.

                                                      Instances For
                                                        @[reducible, inline]
                                                        Instances For
                                                          @[reducible, inline]
                                                          Instances For

                                                            Compact GPT-2-style model with the real GPT-2 BPE vocabulary.

                                                            This is not OpenAI GPT-2-small. It is a TorchLean-native miniature Transformer whose input/output vocabulary matches GPT-2 BPE, so tokenizer/probing behavior is realistic while the model remains small enough for a local smoke run.

                                                            Instances For

                                                              Local projection from original GPT-2 BPE ids to the compact working vocabulary.

                                                              • originals : Array

                                                                Original GPT-2 id for each local id.

                                                              • toLocalMap : Std.HashMap

                                                                Reverse lookup from original GPT-2 id to local id.

                                                              Instances For

                                                                Number of live entries in a local BPE projection.

                                                                Instances For

                                                                  Map an original GPT-2 BPE id into the compact local vocabulary, using local id 0 as OOV.

                                                                  Instances For

                                                                    Map a compact local id back to its original GPT-2 BPE id.

                                                                    Instances For

                                                                      Build the compact working vocabulary from corpus ids and prompt ids.

                                                                      Instances For

                                                                        Apply a local BPE projection to an array of original GPT-2 ids.

                                                                        Instances For

                                                                          Build one BPE training sample from a tokenized corpus.

                                                                          Instances For

                                                                            Argmax token id at the final context position.

                                                                            Instances For

                                                                              Greedy BPE generation by repeatedly feeding the last seqLen tokens and appending the final-position argmax. This is a compact diagnostic loop, not a high-quality sampler.

                                                                              Instances For

                                                                                Train the GPT-2-style model over a text corpus using CUDA.

                                                                                This intentionally performs one optimizer step per corpus window, rather than materializing the entire dataset in memory. The example is still compact by GPT-2 standards, but the data path is real: file bytes → token windows → one-hot tensors → TorchLean CUDA training.

                                                                                Instances For

                                                                                  Load and tokenize the text corpus with GPT-2 BPE.

                                                                                  Instances For

                                                                                    Verbose BPE loader used by this example so long startup work is visible.

                                                                                    Instances For

                                                                                      Print the first BPE training window for sanity.

                                                                                      Instances For

                                                                                        Train the compact GPT-2-style model with the real GPT-2 BPE tokenizer.

                                                                                        This is deliberately a smoke-scale model: it exercises the 50,257-way tokenizer/vocabulary path and can overfit local windows, but it is far too small and short-trained to behave like pretrained GPT-2.

                                                                                        Instances For