TorchLean API

NN.API.Text.Unicode

Unicode helpers for text tokenizers #

Small Unicode classification tables used by Lean-native tokenizers.

The GPT-2 pre-tokenizer regex uses Unicode properties \p{L}, \p{N}, and \s. Lean's built-in Char.isAlpha/Char.isDigit predicates are ASCII-focused in our toolchain, so tokenizer parity needs explicit Unicode tables. The letter/number ranges below were generated from Python unicodedata 15.1.0 by grouping code points whose general category starts with L or N.

These tables are runtime data, not proof axioms. They keep the tokenizer deterministic and Lean-native while making the Unicode dependency visible in one small API module.

Sorted inclusive code-point ranges for Unicode general categories L* (Unicode 15.1.0).

Instances For

    Sorted inclusive code-point ranges for Unicode general categories N* (Unicode 15.1.0).

    Instances For
      def NN.API.text.Unicode.inSortedRangesAux (ranges : Array (Nat × Nat)) (n lo hi : Nat) :
      NatBool

      Binary search over sorted inclusive ranges.

      Instances For

        Test membership in a sorted inclusive range table.

        Instances For

          Unicode \p{L} predicate used by GPT-2's regex pre-tokenizer.

          Instances For

            Unicode \p{N} predicate used by GPT-2's regex pre-tokenizer.

            Instances For

              Unicode regex whitespace predicate for GPT-2 pre-tokenization.

              This is the standard Unicode White_Space set used by Python's regex package for \s: ASCII space controls, NEL, NBSP, Ogham space mark, the U+2000 block spaces, line/paragraph separators, narrow NBSP, medium mathematical space, and ideographic space.

              Instances For