GPT-2 Byte-Pair Encoding #
Lean-native support for GPT-2-style byte-level BPE tokenizers.
This module is intentionally in NN.API.Text rather than a model file: any Transformer, diffusion
LM, or verifier that wants GPT-2-compatible tokenization should share the same implementation.
The implementation parses the standard vocab.json and merges.txt files directly in Lean.
The pre-tokenizer implements the GPT-2 regex shape:
's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
The Unicode \p{L}, \p{N}, and \s predicates are supplied by NN.API.Text.Unicode, rather
than Lean's ASCII-oriented Char.isAlpha / Char.isDigit helpers.
Data #
One token-to-id entry from GPT-2's vocab.json.
Instances For
Instances For
Instances For
Loaded GPT-2 BPE tokenizer.
- vocab : Array VocabEntry
Token vocabulary as loaded from
vocab.json. Ranked merge table from
merges.txt.- vocabMap : Std.HashMap String ℕ
Fast token-to-id lookup derived from
vocab. - idMap : Std.HashMap ℕ String
Fast id-to-token lookup derived from
vocab. - mergeMap : Std.HashMap (String × String) ℕ
Fast pair-to-rank lookup derived from
merges.
Instances For
Byte Escaping #
GPT-2 byte-to-unicode escape for one byte.
Instances For
Inverse of byteToChar, used when decoding BPE token strings back to UTF-8.
Instances For
Reversible GPT-2 byte-to-unicode escape for a string fragment.
Instances For
Decode GPT-2 byte-to-unicode escaped text back into a UTF-8 string.
Instances For
Pre-tokenization #
- letter : RegexClass
- number : RegexClass
- other : RegexClass
Instances For
Instances For
Consume the GPT-2 branch \s+(?!\S).
Python's regex engine greedily takes a whitespace run but may backtrack so the negative lookahead sees either end-of-input or another whitespace character. For a whitespace run before a non-space token, this consumes all but the final whitespace; the final ASCII space can then attach to the next letter/number/punctuation branch, matching GPT-2's standard token boundaries.
Instances For
Fuel-bounded worker for GPT-2 regex pre-token fragments before byte escaping and BPE merges.
The branch order mirrors GPT-2's tokenizer regex exactly: contractions, optional-space letter runs,
optional-space number runs, optional-space non-space/non-letter/non-number runs, whitespace not
followed by non-space, and finally a plain whitespace run. The fuel argument keeps this definition
total and should never be exhausted when called by pretokenize.
Instances For
Split a string into GPT-2-style pre-token fragments.
Instances For
BPE Merging #
Adapt a loaded GPT-2 BPE tokenizer to the generic text-tokenizer interface.
Instances For
File Loading #
Parse GPT-2 vocab.json as an array of (token, id) entries.
Instances For
The standard GPT-2 vocab.json is a single flat JSON object from token strings to numeric ids.
Using Lean's fully general JSON object parser is convenient but slow for interactive examples
because it builds a 50k-entry tree before we immediately flatten it again. The small parser below
recognizes exactly the JSON shape used by GPT-2 vocab files and decodes JSON string escapes,
including \uXXXX escapes for byte-to-unicode code points.
Instances For
Parse GPT-2 vocab.json directly from text.
Instances For
Instances For
Instances For
Parse GPT-2 merges.txt. Invalid non-comment lines are ignored conservatively.
Instances For
Load GPT-2 BPE files directly in Lean.