GPU GPT-2 Corpus Trainer #
This file trains GPT-2-style models from text in TorchLean.
The model is initialized inside TorchLean and trained by the TorchLean runtime. It does not load a pretrained PyTorch/Hugging Face checkpoint:
- reusable tokenization lives in
NN.API.Text/NN.API.Text.Bpe, - the compact GPT-2-style architecture lives in
NN.API.nn.models(seeNN.API.Models.Gpt2), - this file is the runnable corpus trainer and enforces CUDA by default.
The default path is byte-level because it is compact and fast. Passing --bpe-vocab and
--bpe-merges switches to the Lean-native GPT-2 BPE tokenizer, using the standard 50,257-way GPT-2
token vocabulary. That BPE path is still training from scratch in TorchLean; it does not load a
pretrained checkpoint.
Runner subcommand name. This subcommand trains a GPT-2-style model from scratch.
Instances For
Minimum corpus size for the default public training path: 100 MiB.
Instances For
Default byte-level context window for the CUDA corpus trainer.
Keeping this near the file top lets corpus validation and the model architecture agree without depending on declaration order.
Instances For
Parsed local options for the corpus trainer.
- dataFile : System.FilePath
UTF-8 or raw-byte text corpus.
- train : API.Common.LoggedTrainFlags
Shared step count and TrainLog destination.
- finetuneFile? : Option System.FilePath
Optional second corpus for fine-tuning after the main corpus pass.
- finetuneSteps : ℕ
Number of optimizer steps on the fine-tuning corpus.
- logEvery : ℕ
Print loss every
logEverysteps.0disables progress logging. - allowSmallData : Bool
Allow small files for bounded local checks.
- bpeVocab? : Option System.FilePath
Optional GPT-2
vocab.jsonpath. Supplying this plusbpeMerges?enables BPE mode. - bpeMerges? : Option System.FilePath
Optional GPT-2
merges.txtpath. Supplying this plusbpeVocab?enables BPE mode. - prompt : String
Prompt used for the post-training generation probe.
- generate : ℕ
Number of autoregressive BPE tokens to generate in the post-training probe.
- interactive : Bool
Keep the trained CUDA model alive and read prompts from stdin.
Optional text-character cap for bounded BPE runs.
Instances For
Instances For
Instances For
Instances For
Instances For
Parse options owned by this example; runtime flags are parsed by TorchLean.Module.run.
Instances For
Force the runner into the intended CUDA configuration.
Users should not have to remember --cuda --fast-kernels for this example. We still reject
--cpu explicitly because silently switching to CPU would make a large text-training run look
hung rather than correctly configured.
Instances For
Read the primary raw text corpus.
Instances For
Build a supervised next-token sample from already-tokenized ids.
Instances For
Instances For
Byte-level vocabulary: one token per byte.
Instances For
Single-sequence batches keep the example small and fully interactive.
Instances For
Interactive context window.
This shares the folder-level byte context constant so corpus validation, byte training, and BPE training use the same tensor layout. Larger windows require more allocator headroom, not something we should quietly make the default before allocator pressure is solved.
Instances For
Small two-head Transformer width.
Instances For
Transformer embedding width.
Instances For
Feed-forward hidden width.
Instances For
Number of Transformer blocks.
Instances For
Instances For
Instances For
Runnable byte-level GPT-style model for corpus pretraining/fine-tuning.
This is deliberately compact, but it has enough context to make the interactive prompt loop useful for quick local experiments.
Instances For
Instances For
Instances For
Instances For
Instances For
Instances For
Compact vocabulary used by the runnable BPE training path.
The tokenizer still uses GPT-2's real 50,257-token BPE files. For the small Lean/CUDA smoke model we project the corpus tokens into a local vocabulary of the first observed BPE ids. This keeps the example interactive while preserving the tokenizer/data path; a full 50k-way output head is a much larger training run.
Instances For
Keep the BPE smoke model aligned with the small byte-level GPT-2 path.
Instances For
Short context window used by the trainer.
Instances For
Number of attention heads in the miniature BPE Transformer.
Instances For
Per-head width. The model is intentionally compact even though the vocabulary is real GPT-2.
Instances For
Transformer embedding width.
Instances For
Feed-forward hidden width.
Instances For
Number of Transformer blocks.
Instances For
Instances For
Instances For
Compact GPT-2-style model with the real GPT-2 BPE vocabulary.
This is not OpenAI GPT-2-small. It is a TorchLean-native miniature Transformer whose input/output vocabulary matches GPT-2 BPE, so tokenizer/probing behavior is realistic while the model remains small enough for a local smoke run.
Instances For
Local projection from original GPT-2 BPE ids to the compact working vocabulary.
Original GPT-2 id for each local id.
- toLocalMap : Std.HashMap ℕ ℕ
Reverse lookup from original GPT-2 id to local id.
Instances For
Number of live entries in a local BPE projection.
Instances For
Map an original GPT-2 BPE id into the compact local vocabulary, using local id 0 as OOV.
Instances For
Map a compact local id back to its original GPT-2 BPE id.
Instances For
Build the compact working vocabulary from corpus ids and prompt ids.
Instances For
Apply a local BPE projection to an array of original GPT-2 ids.
Instances For
Build one BPE training sample from a tokenized corpus.
Instances For
Turn a BPE prompt into one model input window.
Instances For
Argmax token id at the final context position.
Instances For
Instances For
Instances For
Print an argmax probe for a prompt under the BPE model.
Instances For
Greedy BPE generation by repeatedly feeding the last seqLen tokens and appending the final-position
argmax. This is a compact diagnostic loop, not a high-quality sampler.
Instances For
Train the GPT-2-style model over a text corpus using CUDA.
This intentionally performs one optimizer step per corpus window, rather than materializing the entire dataset in memory. The example is still compact by GPT-2 standards, but the data path is real: file bytes → token windows → one-hot tensors → TorchLean CUDA training.
Instances For
Load and tokenize the text corpus with GPT-2 BPE.
Instances For
Verbose BPE loader used by this example so long startup work is visible.
Instances For
Print the first BPE training window for sanity.
Instances For
Train the compact GPT-2-style model with the real GPT-2 BPE tokenizer.
This is deliberately a smoke-scale model: it exercises the 50,257-way tokenizer/vocabulary path and can overfit local windows, but it is far too small and short-trained to behave like pretrained GPT-2.