TorchLean API

Docs Home Guide Examples Graphs

NN.API.Text.Bpe

GPT-2 Byte-Pair Encoding #

Lean-native support for GPT-2-style byte-level BPE tokenizers.

This module is intentionally in NN.API.Text rather than a model file: any Transformer, diffusion LM, or verifier that wants GPT-2-compatible tokenization should share the same implementation. The implementation parses the standard vocab.json and merges.txt files directly in Lean.

The pre-tokenizer implements the GPT-2 regex shape:

's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

The Unicode \p{L}, \p{N}, and \s predicates are supplied by NN.API.Text.Unicode, rather than Lean's ASCII-oriented Char.isAlpha / Char.isDigit helpers.

Data #

structure NN.API.text.Gpt2Bpe.VocabEntry :

One token-to-id entry from GPT-2's vocab.json.

token : String
Token spelling after GPT-2 byte-to-unicode escaping.
id : ℕ
Token id.

Instances For

def NN.API.text.Gpt2Bpe.instReprVocabEntry.repr :

VocabEntry → ℕ → Std.Format

Instances For

@[implicit_reducible]

instance NN.API.text.Gpt2Bpe.instReprVocabEntry :

Repr VocabEntry

@[implicit_reducible]

instance NN.API.text.Gpt2Bpe.instDecidableEqVocabEntry :

DecidableEq VocabEntry

def NN.API.text.Gpt2Bpe.instDecidableEqVocabEntry.decEq (x✝ x✝¹ : VocabEntry) :

Decidable (x✝ = x✝¹)

Instances For

structure NN.API.text.Gpt2Bpe.MergeRank :

One ranked merge from GPT-2's merges.txt. Lower rank is applied earlier.

left : String
Left symbol.
right : String
Right symbol.
rank : ℕ
Merge priority.

Instances For

@[implicit_reducible]

instance NN.API.text.Gpt2Bpe.instReprMergeRank :

def NN.API.text.Gpt2Bpe.instReprMergeRank.repr :

MergeRank → ℕ → Std.Format

Instances For

def NN.API.text.Gpt2Bpe.instDecidableEqMergeRank.decEq (x✝ x✝¹ : MergeRank) :

Decidable (x✝ = x✝¹)

Instances For

@[implicit_reducible]

instance NN.API.text.Gpt2Bpe.instDecidableEqMergeRank :

DecidableEq MergeRank

structure NN.API.text.Gpt2Bpe.Tokenizer :

Loaded GPT-2 BPE tokenizer.

vocab : Array VocabEntry
Token vocabulary as loaded from vocab.json.
merges : Array MergeRank
Ranked merge table from merges.txt.
vocabMap : Std.HashMap String ℕ
Fast token-to-id lookup derived from vocab.
idMap : Std.HashMap ℕ String
Fast id-to-token lookup derived from vocab.
mergeMap : Std.HashMap (String × String) ℕ
Fast pair-to-rank lookup derived from merges.

Instances For

Byte Escaping #

def NN.API.text.Gpt2Bpe.bytesVisible :

Instances For

def NN.API.text.Gpt2Bpe.bytesLatin1A :

Instances For

def NN.API.text.Gpt2Bpe.bytesLatin1B :

Instances For

def NN.API.text.Gpt2Bpe.baseBytes :

Instances For

def NN.API.text.Gpt2Bpe.containsNat (xs : List ℕ) (x : ℕ) :

Instances For

def NN.API.text.Gpt2Bpe.byteCodeTable :

Instances For

def NN.API.text.Gpt2Bpe.byteToChar (b : UInt8) :

GPT-2 byte-to-unicode escape for one byte.

Instances For

def NN.API.text.Gpt2Bpe.charToByte? (c : Char) :

Inverse of byteToChar, used when decoding BPE token strings back to UTF-8.

Instances For

def NN.API.text.Gpt2Bpe.byteEncode (s : String) :

Reversible GPT-2 byte-to-unicode escape for a string fragment.

Instances For

def NN.API.text.Gpt2Bpe.byteDecode? (s : String) :

Decode GPT-2 byte-to-unicode escaped text back into a UTF-8 string.

Instances For

Pre-tokenization #

inductive NN.API.text.Gpt2Bpe.RegexClass :

letter : RegexClass
number : RegexClass
other : RegexClass

Instances For

@[implicit_reducible]

instance NN.API.text.Gpt2Bpe.instDecidableEqRegexClass :

DecidableEq RegexClass

def NN.API.text.Gpt2Bpe.isRegexOther (c : Char) :

Instances For

def NN.API.text.Gpt2Bpe.matchesRegexClass (cls : RegexClass) (c : Char) :

Instances For

def NN.API.text.Gpt2Bpe.takeWhileChars (p : Char → Bool) :

List Char → List Char × List Char

Instances For

def NN.API.text.Gpt2Bpe.consumeContraction? :

List Char → Option (String × List Char)

Instances For

def NN.API.text.Gpt2Bpe.consumeClassRun? (cls : RegexClass) :

List Char → Option (String × List Char)

Instances For

def NN.API.text.Gpt2Bpe.consumeWhitespaceNotFollowedByNonspace? (xs : List Char) :

Option (String × List Char)

Consume the GPT-2 branch \s+(?!\S).

Python's regex engine greedily takes a whitespace run but may backtrack so the negative lookahead sees either end-of-input or another whitespace character. For a whitespace run before a non-space token, this consumes all but the final whitespace; the final ASCII space can then attach to the next letter/number/punctuation branch, matching GPT-2's standard token boundaries.

Instances For

def NN.API.text.Gpt2Bpe.consumeWhitespaceRun? (xs : List Char) :

Option (String × List Char)

Instances For

def NN.API.text.Gpt2Bpe.pretokenizeAux :

ℕ → List Char → List String

Fuel-bounded worker for GPT-2 regex pre-token fragments before byte escaping and BPE merges.

The branch order mirrors GPT-2's tokenizer regex exactly: contractions, optional-space letter runs, optional-space number runs, optional-space non-space/non-letter/non-number runs, whitespace not followed by non-space, and finally a plain whitespace run. The fuel argument keeps this definition total and should never be exhausted when called by pretokenize.

Instances For

def NN.API.text.Gpt2Bpe.pretokenize (s : String) :

Split a string into GPT-2-style pre-token fragments.

Instances For

BPE Merging #

def NN.API.text.Gpt2Bpe.vocabId? (tok : Tokenizer) (s : String) :

Instances For

def NN.API.text.Gpt2Bpe.tokenString? (tok : Tokenizer) (id : ℕ) :

Instances For

def NN.API.text.Gpt2Bpe.mergeRank? (tok : Tokenizer) (a b : String) :

Instances For

def NN.API.text.Gpt2Bpe.bestMerge? (tok : Tokenizer) :

List String → Option (String × String × ℕ)

Instances For

def NN.API.text.Gpt2Bpe.applyMerge (target : String × String) :

List String → List String

Instances For

def NN.API.text.Gpt2Bpe.bpeLoop (tok : Tokenizer) :

ℕ → List String → List String

Instances For

def NN.API.text.Gpt2Bpe.encodeFragment (tok : Tokenizer) (fragment : String) :

Except String (List ℕ)

Apply BPE to one pre-tokenized fragment.

Instances For

def NN.API.text.Gpt2Bpe.encode (tok : Tokenizer) (text : String) :

Except String (List ℕ)

Encode text using the loaded GPT-2 BPE files.

Instances For

def NN.API.text.Gpt2Bpe.decode? (tok : Tokenizer) (ids : List ℕ) :

Except String String

Decode GPT-2 BPE ids back to text.

Instances For

def NN.API.text.Gpt2Bpe.decodeD (tok : Tokenizer) (ids : List ℕ) :

Total display-oriented decoder: invalid ids/UTF-8 decode to an empty string.

Instances For

def NN.API.text.Gpt2Bpe.asTextTokenizer (tok : Tokenizer) :

Adapt a loaded GPT-2 BPE tokenizer to the generic text-tokenizer interface.

Instances For

File Loading #

def NN.API.text.Gpt2Bpe.parseVocab (j : Lean.Json) :

Except String (Array VocabEntry)

Parse GPT-2 vocab.json as an array of (token, id) entries.

Instances For

The standard GPT-2 vocab.json is a single flat JSON object from token strings to numeric ids. Using Lean's fully general JSON object parser is convenient but slow for interactive examples because it builds a 50k-entry tree before we immediately flatten it again. The small parser below recognizes exactly the JSON shape used by GPT-2 vocab files and decodes JSON string escapes, including \uXXXX escapes for byte-to-unicode code points.

def NN.API.text.Gpt2Bpe.charAtD (cs : Array Char) (i : ℕ) :

Instances For

def NN.API.text.Gpt2Bpe.skipJsonWs (cs : Array Char) (i : ℕ) :

Instances For

def NN.API.text.Gpt2Bpe.hexVal? (c : Char) :

Instances For

def NN.API.text.Gpt2Bpe.parseHex4? (cs : Array Char) (i : ℕ) :

Instances For

def NN.API.text.Gpt2Bpe.combineSurrogate (hi lo : ℕ) :

Instances For

def NN.API.text.Gpt2Bpe.parseJsonStringAux (cs : Array Char) :

ℕ → ℕ → List Char → Except String (String × ℕ)

Instances For

def NN.API.text.Gpt2Bpe.parseJsonStringAt (cs : Array Char) (i : ℕ) :

Except String (String × ℕ)

Instances For

def NN.API.text.Gpt2Bpe.parseNatAt (cs : Array Char) (i : ℕ) :

Except String (ℕ × ℕ)

Instances For

def NN.API.text.Gpt2Bpe.parseVocabTextLoop (cs : Array Char) :

ℕ → ℕ → Array VocabEntry → Except String (Array VocabEntry)

Instances For

def NN.API.text.Gpt2Bpe.parseVocabText (s : String) :

Except String (Array VocabEntry)

Parse GPT-2 vocab.json directly from text.

Instances For

def NN.API.text.Gpt2Bpe.vocabMapOf (vocab : Array VocabEntry) :

Std.HashMap String ℕ

Instances For

def NN.API.text.Gpt2Bpe.idMapOf (vocab : Array VocabEntry) :

Std.HashMap ℕ String

Instances For

def NN.API.text.Gpt2Bpe.mergeMapOf (merges : Array MergeRank) :

Std.HashMap (String × String) ℕ

Instances For

def NN.API.text.Gpt2Bpe.parseMergeLine? (rank : ℕ) (line : String) :

Option MergeRank

Instances For

def NN.API.text.Gpt2Bpe.parseMerges (s : String) :

Array MergeRank

Parse GPT-2 merges.txt. Invalid non-comment lines are ignored conservatively.

Instances For

def NN.API.text.Gpt2Bpe.load (vocabJson mergesTxt : System.FilePath) :

Load GPT-2 BPE files directly in Lean.

Instances For