Unicode helpers for text tokenizers #
Small Unicode classification tables used by Lean-native tokenizers.
The GPT-2 pre-tokenizer regex uses Unicode properties \p{L}, \p{N}, and \s. Lean's built-in
Char.isAlpha/Char.isDigit predicates are ASCII-focused in our toolchain, so tokenizer parity
needs explicit Unicode tables. The letter/number ranges below were generated from Python
unicodedata 15.1.0 by grouping code points whose general category starts with L or N.
These tables are runtime data, not proof axioms. They keep the tokenizer deterministic and Lean-native while making the Unicode dependency visible in one small API module.
Unicode \p{L} predicate used by GPT-2's regex pre-tokenizer.
Instances For
Unicode \p{N} predicate used by GPT-2's regex pre-tokenizer.
Instances For
Unicode regex whitespace predicate for GPT-2 pre-tokenization.
This is the standard Unicode White_Space set used by Python's regex package for \s: ASCII
space controls, NEL, NBSP, Ogham space mark, the U+2000 block spaces, line/paragraph separators,
narrow NBSP, medium mathematical space, and ideographic space.