Unicode helpers for text tokenizers #

Small Unicode classification tables used by Lean-native tokenizers.

The GPT-2 pre-tokenizer regex uses Unicode properties \p{L}, \p{N}, and \s. Lean's built-in Char.isAlpha/Char.isDigit predicates are ASCII-focused in our toolchain, so tokenizer parity needs explicit Unicode tables. The letter/number ranges below were generated from Python unicodedata 15.1.0 by grouping code points whose general category starts with L or N.

These tables are runtime data, not proof axioms. They keep the tokenizer deterministic and Lean-native while making the Unicode dependency visible in one small API module.

source

def NN.API.text.Unicode.unicodeLetterRanges :

Array (Nat × Nat)

Sorted inclusive code-point ranges for Unicode general categories L* (Unicode 15.1.0).

Instances For

source

def NN.API.text.Unicode.unicodeNumberRanges :

Array (Nat × Nat)

Sorted inclusive code-point ranges for Unicode general categories N* (Unicode 15.1.0).

Instances For

source

def NN.API.text.Unicode.inSortedRangesAux (ranges : Array (Nat × Nat)) (n lo hi : Nat) :

Nat → Bool

Binary search over sorted inclusive ranges.

Instances For

source

def NN.API.text.Unicode.inSortedRanges (ranges : Array (Nat × Nat)) (n : Nat) :

Bool

Test membership in a sorted inclusive range table.

Instances For

source

def NN.API.text.Unicode.isLetter (c : Char) :

Bool

Unicode \p{L} predicate used by GPT-2's regex pre-tokenizer.

Instances For

source

def NN.API.text.Unicode.isNumber (c : Char) :

Bool

Unicode \p{N} predicate used by GPT-2's regex pre-tokenizer.

Instances For

source

def NN.API.text.Unicode.isRegexWhitespace (c : Char) :

Bool

Unicode regex whitespace predicate for GPT-2 pre-tokenization.

This is the standard Unicode White_Space set used by Python's regex package for \s: ASCII space controls, NEL, NBSP, Ogham space mark, the U+2000 block spaces, line/paragraph separators, narrow NBSP, medium mathematical space, and ideographic space.

Instances For

TorchLean API

NN.API.Text.Unicode

Unicode helpers for text tokenizers #