GPU GPT-2 Corpus Trainer #
This command trains GPT-2-style models from text in TorchLean.
The model is initialized inside TorchLean and trained by the TorchLean runtime. It does not load a pretrained PyTorch/Hugging Face checkpoint:
- reusable tokenization lives under
TorchLean.text, - the compact GPT-2-style architecture lives under
TorchLean.nn.models, - the runnable corpus trainer enforces CUDA by default.
The default path keeps the byte-level model compact so the corpus trainer is quick to run. Passing
--bpe-vocab and --bpe-merges switches to the Lean-native GPT-2 BPE tokenizer, using the standard
50,257-way GPT-2 token vocabulary. That BPE path still trains a randomly initialized model in
TorchLean; it does not load a pretrained checkpoint.
Runner subcommand name. This subcommand trains a randomly initialized GPT-2-style model.
Instances For
Default JSON loss-curve path for this command.
Instances For
Minimum corpus size for the default public training path: 100 MiB.
Instances For
Default byte-level context window for the CUDA corpus trainer.
Keeping this near the file top lets corpus validation and the model architecture agree without depending on declaration order.
Instances For
Read the primary raw text corpus.
Instances For
Compact byte-level vocabulary for the default corpus path.
Instances For
Single-sequence batch for the byte-level corpus path.
Instances For
Interactive context window.
This shares the folder-level byte context constant so corpus validation, byte training, and BPE training use the same tensor layout. Larger windows require more allocator headroom, not something we should quietly make the default before allocator pressure is solved.
Instances For
Number of attention heads in the compact byte-level Transformer.
Instances For
Transformer embedding width.
Instances For
Feed-forward hidden width.
Instances For
Number of Transformer blocks.
Instances For
Byte-level GPT configuration shared by shapes and the model constructor.
Instances For
Input shape: byte-level one-hot token sequence.
Instances For
Output shape: one byte-logit row per input position.
Instances For
Runnable byte-level GPT-style model for corpus pretraining/fine-tuning.
The model is compact enough for the eager CUDA path while still exercising attention, feed-forward layers, byte tokenization, and the interactive prompt loop.
Instances For
Build one byte-level training sample from a corpus byte offset.
Instances For
Build one byte-level prompt sample for before/after generation reports.
Instances For
Greedy byte-level generation from the trained model.
Instances For
Terminal prompt loop for the trained byte-level model.
Instances For
Compact vocabulary used by the runnable BPE training path.
The tokenizer still uses GPT-2's real 50,257-token BPE files. For this Lean/CUDA model we project the corpus tokens into a local vocabulary of the first observed BPE ids. A full 50k-way output head is a much larger training run; this example focuses on the tokenizer/data path.
Instances For
Batch size for the BPE corpus path.
Instances For
Short context window used by the trainer.
Instances For
Number of attention heads in the miniature BPE Transformer.
Instances For
Per-head width for the BPE Transformer.
Instances For
Transformer embedding width.
Instances For
Feed-forward hidden width.
Instances For
Number of Transformer blocks.
Instances For
BPE GPT configuration shared by shapes and the model constructor.
Instances For
Input shape: local-BPE one-hot token batch.
Instances For
Output shape: one local-BPE logit row per input position.
Instances For
Compact GPT-2-style model with the real GPT-2 BPE tokenizer path.
This is not OpenAI GPT-2-small. It is a TorchLean-native Transformer whose tokenizer comes from GPT-2 BPE files and whose output head uses a local projection of the observed corpus ids.
Instances For
Build one BPE training sample from a tokenized corpus.
Instances For
Turn a BPE prompt into one model input window.
Instances For
Decode original GPT-2 BPE ids with the loaded tokenizer.
Instances For
Decode local BPE ids by mapping them back to original GPT-2 ids first.
Instances For
Print an argmax prediction report for a prompt under the BPE model.
Instances For
Greedy BPE generation by repeatedly feeding the last seqLen tokens and appending the final-position
argmax. This is a deterministic sampling path for inspecting the trained next-token model.
Instances For
Train the GPT-2-style model over a text corpus using CUDA.
This performs one optimizer step per corpus window, rather than materializing the entire dataset in memory. The example is compact by GPT-2 standards, but the data path is real: file bytes → token windows → one-hot tensors → TorchLean CUDA training.
Instances For
Load and tokenize the text corpus with GPT-2 BPE.
Instances For
Print the first BPE training window for inspecting tokenization and windowing.
Instances For
Train the compact GPT-2-style model with the real GPT-2 BPE tokenizer.
This exercises the GPT-2 tokenizer/vocabulary path and can overfit local windows. It is not a pretrained GPT-2 checkpoint; it is a randomly initialized TorchLean model trained by this command.