GPT-2 Byte-Pair Encoding #
Lean-native support for GPT-2-style byte-level BPE tokenizers.
This module lives in NN.API.Text rather than a model file: any Transformer, diffusion LM, or
verifier that wants GPT-2-compatible tokenization should share the same implementation.
The implementation parses the standard vocab.json and merges.txt files directly in Lean.
The pre-tokenizer implements the GPT-2 regex shape:
's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
The Unicode \p{L}, \p{N}, and \s predicates are supplied by NN.API.Text.Unicode, rather
than Lean's ASCII-oriented Char.isAlpha / Char.isDigit helpers.
Data #
One token-to-id entry from GPT-2's vocab.json.
Instances For
Instances For
Instances For
Loaded GPT-2 BPE tokenizer.
- vocab : Array VocabEntry
Token vocabulary as loaded from
vocab.json. Ranked merge table from
merges.txt.- vocabMap : Std.HashMap String ℕ
Fast token-to-id lookup derived from
vocab. - idMap : Std.HashMap ℕ String
Fast id-to-token lookup derived from
vocab. - mergeMap : Std.HashMap (String × String) ℕ
Fast pair-to-rank lookup derived from
merges.
Instances For
Byte Escaping #
Bytes that GPT-2 leaves at their visible Unicode code points.
Instances For
First Latin-1 byte range kept visible by GPT-2 byte escaping.
Instances For
Second Latin-1 byte range kept visible by GPT-2 byte escaping.
Instances For
Bytes that do not need synthetic code points in GPT-2 byte escaping.
Instances For
GPT-2 byte-to-Unicode code-point table for all 256 byte values.
Instances For
GPT-2 byte-to-unicode escape for one byte.
Instances For
Inverse of byteToChar, used when decoding BPE token strings back to UTF-8.
Instances For
Reversible GPT-2 byte-to-unicode escape for a string fragment.
Instances For
Decode GPT-2 byte-to-unicode escaped text back into a UTF-8 string.
Instances For
Pre-tokenization #
Character classes used by the GPT-2 pre-tokenizer branches.
- letter : RegexClass
- number : RegexClass
- other : RegexClass
Instances For
Character predicate for GPT-2's non-whitespace, non-letter, non-number regex branch.
Instances For
Test whether a character belongs to one of the GPT-2 regex classes.
Instances For
Consume one GPT-2 letter/number/other run, allowing a leading ASCII space.
Instances For
Consume the GPT-2 branch \s+(?!\S).
Python's regex engine greedily takes a whitespace run but may backtrack so the negative lookahead sees either end-of-input or another whitespace character. For a whitespace run before a non-space token, this consumes all but the final whitespace; the final ASCII space can then attach to the next letter/number/punctuation branch, matching GPT-2's standard token boundaries.
Instances For
Fuel-bounded worker for GPT-2 regex pre-token fragments before byte escaping and BPE merges.
The branch order mirrors GPT-2's tokenizer regex exactly: contractions, optional-space letter runs,
optional-space number runs, optional-space non-space/non-letter/non-number runs, whitespace not
followed by non-space, and finally a plain whitespace run. The fuel argument keeps this definition
total; pretokenize supplies enough fuel for the whole input.
Instances For
Split a string into GPT-2-style pre-token fragments.
Instances For
BPE Merging #
Adapt a loaded GPT-2 BPE tokenizer to the generic text-tokenizer interface.
Instances For
File Loading #
Parse GPT-2 vocab.json as an array of (token, id) entries.
Instances For
The standard GPT-2 vocab.json is a single flat JSON object from token strings to numeric ids.
Using Lean's fully general JSON object parser is convenient but slow for interactive examples
because it builds a 50k-entry tree before we immediately flatten it again. The small parser below
recognizes exactly the JSON shape used by GPT-2 vocab files and decodes JSON string escapes,
including \uXXXX escapes for byte-to-unicode code points.
Interpret one hexadecimal digit from a JSON unicode escape.
Instances For
Combine a JSON UTF-16 surrogate pair into one Unicode code point.
Instances For
Fuel-bounded loop for the specialized GPT-2 vocab.json object parser.
Instances For
Parse GPT-2 vocab.json directly from text.
Instances For
Build the token-to-id lookup table stored in a loaded GPT-2 BPE tokenizer.
Instances For
Build the id-to-token lookup table stored in a loaded GPT-2 BPE tokenizer.
Instances For
Build the pair-to-rank lookup table stored in a loaded GPT-2 BPE tokenizer.
Instances For
Assemble a tokenizer and its lookup maps from parsed GPT-2 vocabulary and merge tables.
Instances For
Parse GPT-2 merges.txt. Invalid non-comment lines are ignored conservatively.
Instances For
Load GPT-2 BPE files directly in Lean.
Instances For
Load GPT-2 BPE files while printing progress for larger vocab.json / merges.txt assets.
Instances For
Compact vocabulary projections #
Projection from original tokenizer ids to a compact working vocabulary.
Small runnable language-model examples can keep a real tokenizer while restricting the model head to ids observed in the local corpus or prompt.
Original tokenizer id for each local id.
- toLocalMap : Std.HashMap ℕ ℕ
Reverse lookup from original tokenizer id to local id.
Instances For
Number of live entries in a compact BPE projection.
Instances For
Map an original tokenizer id into the compact local vocabulary, using local id 0 as OOV.
Instances For
Map a compact local id back to its original tokenizer id.
Instances For
Build a compact working vocabulary from corpus ids and prompt ids.
Instances For
Apply a compact BPE projection to an array of original tokenizer ids.