Skip to the content.

This page follows the text-model examples from corpus to continuation. The models are small, but the workflow is complete: read a text file, build next-token training examples, run a training step, save parameters, reload them, and sample from the saved model.

The useful part is visibility. Tokenization, sequence length, causal windows, parameter shapes, generation settings, and saved logs all appear as artifacts the reader can inspect.

Data: One Explicit Text File

The text examples use a single UTF-8 text file. In the default setup, the download script places Tiny Shakespeare under data/real/text/, and the Lean example fails loudly if the file is missing.

Run the data step once:

python3 scripts/datasets/download_example_data.py --tiny-shakespeare

The function that turns “flags and paths” into an actual corpus is shared across examples. In the GPT-2 example, takeInputText uses text.Corpus.takeUtf8Input to support a “use this known corpus” flag or an explicit --data-file path.

Tokenization: Bytes on Purpose

These examples tokenize bytes directly. Every UTF-8 byte is a token, so the vocabulary is fixed and tiny (256). That choice is practical:

In code, this shows up as:

def vocab : Nat := text.Tokenizer.byte.vocabSize

That line is from NN.Examples.Models.Sequence.Gpt2. It means the tutorial model is not silently depending on a tokenizer JSON, BPE merge table, or Hugging Face cache entry. The byte tokenizer is small enough that the input and target tensors can be written down directly.

Supervised Examples: Next-Token Prediction As Tensors

The training data is represented directly as typed supervised samples. The examples build explicit SupervisedSample values whose shapes say what they are.

For GPT-2, the sample is a one-hot matrix for causal language modeling:

abbrev σ : Shape := shape![batch, seqLen, vocab]
abbrev τ : Shape := σ

The function Data.causalLmOneHotSample converts a (seqLen + 1) token window into (x, y): input tokens and the same window shifted by one position as the target. The trainer consumes those typed tensors directly.

The Lean sample constructor is explicit:

def mkSampleFromTokenIds (toks : List Nat) : SupervisedSample Float σ τ :=
  Data.causalLmOneHotSample (α := Float) batch seqLen vocab toks (padId := 32)

This is why the example is useful as a tutorial: there is no hidden dataloader convention. A supervised example is literally a pair of typed tensors.

GPT-2: A Small Causal Transformer

NN.Examples.Models.Sequence.Gpt2 wires up a miniature causal transformer from reusable layers (nn.models.CausalTransformerOneHot). The default configuration is compact enough for a local run, but it is still large enough to learn short structure from Tiny Shakespeare.

The model declaration is a normal Lean value:

def model : nn.M (nn.Sequential σ τ) :=
  nn.models.CausalTransformerOneHot
    { batch := batch
      seqLen := seqLen
      vocab := vocab
      numHeads := numHeads
      headDim := headDim
      ffnHidden := ffnHidden
      layers := layers }

The training loop stays on the same public surface used by the simpler quickstarts:

let trainer := Trainer.new model <|
  Trainer.Config.fromRunConfig run (.crossEntropy)
let trained ← trainer.train data train.options probes
trained.printSummary

The surrounding code builds a bank of token windows from the corpus, reports before/after predictions, saves checkpoints when requested, and samples text from the trained prediction closure.

Sampling is also explicit: the example computes logits, applies temperature and top-k filtering, and chooses the next token. If you’ve ever written a small “Karpathy-style” sampler, the code will look familiar.

Try the short CUDA run:

lake exe -K cuda=true torchlean gpt2 --cuda --fast-kernels --tiny-shakespeare \
  --steps 300 --windows 32 --lr 0.001 --prompt "ROMEO:" --generate 220 \
  --temperature 0.85 --top-k 24 --repeat-penalty 1.25 --repeat-window 24 \
  --sample-seed 11 --log data/model_zoo/gpt2_trainlog.json

For a tiny runtime check, keep the same CUDA path and shrink the workload:

lake exe -K cuda=true torchlean gpt2 --cuda --tiny-shakespeare --steps 1 --windows 1 --generate 0

Saving and Reloading Parameters

TorchLean’s checkpoint format is kept simple: a model’s parameters are a shape-indexed pack, and the save/load operations round-trip exact IEEE-754 bit patterns through JSON.

That is why gpt2_saved is a separate example: it loads a parameter pack, checks that the shapes match the model architecture, and runs sampling without touching an optimizer.

The GPT-2 command exposes this through --save-params; the saved-parameter example reloads the same shape-indexed parameter pack before sampling. If the shape list no longer matches the model, loading fails before the weights are used.

Mamba: State-Space Text In The Same Runtime

NN.Examples.Models.Sequence.Mamba also runs on byte tokens, but swaps attention for a compact state-space block. It is useful as a contrast: sequence modeling without the quadratic attention path, using the same autograd and the same logging style.

That contrast is useful because it shows that the text pipeline is not tied to attention.

Algorithmically, the example replaces “attend over all previous tokens” with a learned recurrent state update. Each token updates a compact state, and the model projects the resulting sequence states to next-token logits. The surrounding training interface stays the same: corpus windows in, logits out, cross-entropy loss, optimizer step, JSON log.

The Mamba example has the same tutorial shape:

abbrev σ : Shape := nn.models.mambaTokenMat cfg seqLen
abbrev τ : Shape := nn.models.mambaLogitMat cfg seqLen

def model : nn.M (nn.Sequential σ τ) :=
  nn.models.MambaTextLM cfg seqLen

and the training body is still an ordinary trainer call:

let trainer := Trainer.new model <|
  Trainer.Config.fromRunConfig run (.crossEntropy)
let trained ← trainer.train data train.options probes
trained.printSummary

Run it with:

lake exe -K cuda=true torchlean mamba --cuda --fast-kernels --tiny-shakespeare \
  --steps 1 --windows 1 --generate 0

Inspecting Runs In The Lean Editor

For interactive inspection, open NN.Examples.Models.Sequence.Gpt2 or NN.Examples.Models.Sequence.Mamba in VS Code with the Lean Infoview enabled. The relevant widgets can display saved training logs, tensor summaries, inferred shapes, and debug traces next to the Lean source.

A concrete starting point is the training-log widget documented near the top of NN.Examples.Models.Sequence.Gpt2; it renders a saved JSON loss log in the Infoview.

Source entry points: