GPT-2-Style Causal Language Model Example #
Runnable torchlean gpt2 example. It builds a small GPT-2-style causal transformer over
byte-level tokens, with optional real text input from tiny-shakespeare or --data-file PATH.
If you are looking for the simplest "Karpathy-style single text file" path, start with
torchlean chargpt (character-level tokenizer). This gpt2 example is byte-level and is meant to
show the Transformer block wiring and save/reload loop.
python3 scripts/datasets/download_example_data.py --tiny-shakespeare
lake build -R -K cuda=true && lake exe torchlean gpt2 --cuda --tiny-shakespeare --steps 100
Small batch size.
The executable intentionally overfits a small real-text slice rather than presenting it as a full pretraining run: it shows the full TorchLean stack can run a causal Transformer, update parameters, and decode logits back to text.
Instances For
Prompt/target window length.
Sixty-four byte tokens is still small enough for local eager-CUDA runs, but it gives the miniature Transformer enough local context to learn short names, line breaks, speaker prefixes, and a little phrase structure in Tiny Shakespeare. Shorter windows are useful for parser/kernel checks but underrepresent the model stack during text generation.
Instances For
Byte-level vocabulary size. Each UTF-8 byte is one token.
Instances For
Number of attention heads in the miniature Transformer block.
Instances For
Per-head embedding width. The model dimension is numHeads * headDim.
We keep the default small so the tutorial finishes locally. A wider dModel = 64 variant runs, but
in the current eager-CUDA training loop it is slower and did not improve the 2k-step Shakespeare
sample enough to justify making it the default. Use this file to inspect Transformer/autograd
behavior; use the Mamba example when the goal is the cleanest compact text sample.
Instances For
Hidden width of the feed-forward sublayer.
Instances For
Number of Transformer encoder blocks.
Instances For
Conventional local path for the Tiny Shakespeare text corpus.
Instances For
Conventional local path for the TinyStories validation slice.
Instances For
Shared data-preparation hint for the GPT text examples.
Instances For
Instances For
Build a batch sample from per-row token windows.
idsByBatch[i] is the (seqLen + 1)-token window for batch row i. If fewer than batch windows
are provided we repeat the last one; callers should normally pass exactly batch windows.
Instances For
Build one next-token-prediction sample from text.
Instances For
- base : API.Common.ModelTrainFlags
- windows : ℕ
- prompt : String
- generate : ℕ
- temperature : Float
- topK : ℕ
- repeatPenalty : Float
- repeatWindow : ℕ
- seed : ℕ
- asciiOnly : Bool
- interactive : Bool
- loadParams? : Option System.FilePath
- saveParams? : Option System.FilePath
Instances For
Instances For
Instances For
Instances For
Instances For
Instances For
Instances For
Instances For
Print a compact before/after language-model probe for the first batch row.
Instances For
Instances For
Instances For
Apply a lightweight repetition penalty during decoding.
This is intentionally a generation-side control, not a training shortcut. This compact GPT-2-style example can
learn the local next-token objective but still fall into byte-level loops such as oooooo; reducing
the logits of recently emitted bytes makes the example's sampled text reflect more of the learned
distribution instead of the first local attractor it finds.
Instances For
Instances For
Instances For
Instances For
Instances For
Instances For
Compact interactive prompt loop for the in-memory Float model.
This is a diagnostic REPL, not pretrained text generation. Each line is interpreted as one causal LM window, and the model prints the per-position argmax prediction for that window.
Instances For
Instances For
Float-specialized training path with decoded prediction probes.
The CUDA executable uses Lean Float tensors, so this branch can show actual prompt,
target, and predicted text before and after training. The polymorphic path above is still used for
non-Float dtype smoke runs.