TorchLean API

NN.API.Text.Bpe

GPT-2 Byte-Pair Encoding #

Lean-native support for GPT-2-style byte-level BPE tokenizers.

This module lives in NN.API.Text rather than a model file: any Transformer, diffusion LM, or verifier that wants GPT-2-compatible tokenization should share the same implementation. The implementation parses the standard vocab.json and merges.txt files directly in Lean.

The pre-tokenizer implements the GPT-2 regex shape:

's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

The Unicode \p{L}, \p{N}, and \s predicates are supplied by NN.API.Text.Unicode, rather than Lean's ASCII-oriented Char.isAlpha / Char.isDigit helpers.

Data #

One token-to-id entry from GPT-2's vocab.json.

  • token : String

    Token spelling after GPT-2 byte-to-unicode escaping.

  • id :

    Token id.

Instances For

    One ranked merge from GPT-2's merges.txt. Lower rank is applied earlier.

    • left : String

      Left symbol.

    • right : String

      Right symbol.

    • rank :

      Merge priority.

    Instances For
      Instances For

        Loaded GPT-2 BPE tokenizer.

        Instances For

          Byte Escaping #

          Bytes that GPT-2 leaves at their visible Unicode code points.

          Instances For

            First Latin-1 byte range kept visible by GPT-2 byte escaping.

            Instances For

              Second Latin-1 byte range kept visible by GPT-2 byte escaping.

              Instances For

                Bytes that do not need synthetic code points in GPT-2 byte escaping.

                Instances For

                  Boolean membership test used while constructing the byte escape table.

                  Instances For

                    GPT-2 byte-to-Unicode code-point table for all 256 byte values.

                    Instances For

                      GPT-2 byte-to-unicode escape for one byte.

                      Instances For

                        Inverse of byteToChar, used when decoding BPE token strings back to UTF-8.

                        Instances For

                          Reversible GPT-2 byte-to-unicode escape for a string fragment.

                          Instances For

                            Decode GPT-2 byte-to-unicode escaped text back into a UTF-8 string.

                            Instances For

                              Pre-tokenization #

                              Character classes used by the GPT-2 pre-tokenizer branches.

                              Instances For

                                Character predicate for GPT-2's non-whitespace, non-letter, non-number regex branch.

                                Instances For

                                  Test whether a character belongs to one of the GPT-2 regex classes.

                                  Instances For

                                    Split a character list at the first character that fails p.

                                    Instances For

                                      Consume one GPT-2 contraction token such as 's or 'll, if present.

                                      Instances For

                                        Consume one GPT-2 letter/number/other run, allowing a leading ASCII space.

                                        Instances For

                                          Consume the GPT-2 branch \s+(?!\S).

                                          Python's regex engine greedily takes a whitespace run but may backtrack so the negative lookahead sees either end-of-input or another whitespace character. For a whitespace run before a non-space token, this consumes all but the final whitespace; the final ASCII space can then attach to the next letter/number/punctuation branch, matching GPT-2's standard token boundaries.

                                          Instances For

                                            Consume a plain whitespace run when the lookahead-sensitive branch did not apply.

                                            Instances For

                                              Fuel-bounded worker for GPT-2 regex pre-token fragments before byte escaping and BPE merges.

                                              The branch order mirrors GPT-2's tokenizer regex exactly: contractions, optional-space letter runs, optional-space number runs, optional-space non-space/non-letter/non-number runs, whitespace not followed by non-space, and finally a plain whitespace run. The fuel argument keeps this definition total; pretokenize supplies enough fuel for the whole input.

                                              Instances For

                                                Split a string into GPT-2-style pre-token fragments.

                                                Instances For

                                                  BPE Merging #

                                                  Look up a token id in a loaded tokenizer.

                                                  Instances For

                                                    Look up the token spelling for a token id.

                                                    Instances For

                                                      Look up the merge rank for an adjacent pair of BPE symbols.

                                                      Instances For

                                                        Find the lowest-ranked merge currently available in a symbol list.

                                                        Instances For

                                                          Apply one BPE merge everywhere it appears in the current symbol list.

                                                          Instances For

                                                            Fuel-bounded BPE merge loop for a single escaped pre-token fragment.

                                                            Instances For

                                                              Apply BPE to one pre-tokenized fragment.

                                                              Instances For

                                                                Encode text using the loaded GPT-2 BPE files.

                                                                Instances For

                                                                  Decode GPT-2 BPE ids back to text.

                                                                  Instances For

                                                                    Total display-oriented decoder: invalid ids/UTF-8 decode to an empty string.

                                                                    Instances For

                                                                      Adapt a loaded GPT-2 BPE tokenizer to the generic text-tokenizer interface.

                                                                      Instances For

                                                                        File Loading #

                                                                        Parse GPT-2 vocab.json as an array of (token, id) entries.

                                                                        Instances For

                                                                          The standard GPT-2 vocab.json is a single flat JSON object from token strings to numeric ids. Using Lean's fully general JSON object parser is convenient but slow for interactive examples because it builds a 50k-entry tree before we immediately flatten it again. The small parser below recognizes exactly the JSON shape used by GPT-2 vocab files and decodes JSON string escapes, including \uXXXX escapes for byte-to-unicode code points.

                                                                          Safe character lookup used by the specialized vocab.json parser.

                                                                          Instances For

                                                                            Skip JSON whitespace in the specialized GPT-2 vocabulary parser.

                                                                            Instances For

                                                                              Interpret one hexadecimal digit from a JSON unicode escape.

                                                                              Instances For

                                                                                Parse four hexadecimal digits starting at i.

                                                                                Instances For

                                                                                  Combine a JSON UTF-16 surrogate pair into one Unicode code point.

                                                                                  Instances For

                                                                                    Fuel-bounded worker for JSON string parsing with escape handling.

                                                                                    Instances For

                                                                                      Parse a JSON string beginning at index i.

                                                                                      Instances For

                                                                                        Parse a natural-number literal beginning at index i.

                                                                                        Instances For

                                                                                          Fuel-bounded loop for the specialized GPT-2 vocab.json object parser.

                                                                                          Instances For

                                                                                            Parse GPT-2 vocab.json directly from text.

                                                                                            Instances For

                                                                                              Build the token-to-id lookup table stored in a loaded GPT-2 BPE tokenizer.

                                                                                              Instances For

                                                                                                Build the id-to-token lookup table stored in a loaded GPT-2 BPE tokenizer.

                                                                                                Instances For

                                                                                                  Build the pair-to-rank lookup table stored in a loaded GPT-2 BPE tokenizer.

                                                                                                  Instances For

                                                                                                    Assemble a tokenizer and its lookup maps from parsed GPT-2 vocabulary and merge tables.

                                                                                                    Instances For

                                                                                                      Parse one non-comment merges.txt line with its rank.

                                                                                                      Instances For

                                                                                                        Parse GPT-2 merges.txt. Invalid non-comment lines are ignored conservatively.

                                                                                                        Instances For

                                                                                                          Load GPT-2 BPE files directly in Lean.

                                                                                                          Instances For

                                                                                                            Load GPT-2 BPE files while printing progress for larger vocab.json / merges.txt assets.

                                                                                                            Instances For

                                                                                                              Compact vocabulary projections #

                                                                                                              Projection from original tokenizer ids to a compact working vocabulary.

                                                                                                              Small runnable language-model examples can keep a real tokenizer while restricting the model head to ids observed in the local corpus or prompt.

                                                                                                              • originals : Array

                                                                                                                Original tokenizer id for each local id.

                                                                                                              • toLocalMap : Std.HashMap

                                                                                                                Reverse lookup from original tokenizer id to local id.

                                                                                                              Instances For

                                                                                                                Number of live entries in a compact BPE projection.

                                                                                                                Instances For

                                                                                                                  Map an original tokenizer id into the compact local vocabulary, using local id 0 as OOV.

                                                                                                                  Instances For

                                                                                                                    Map a compact local id back to its original tokenizer id.

                                                                                                                    Instances For
                                                                                                                      def NN.API.text.buildLocalBpeVocab (maxVocab : ) (corpusIds promptIds : Array ) :

                                                                                                                      Build a compact working vocabulary from corpus ids and prompt ids.

                                                                                                                      Instances For

                                                                                                                        Apply a compact BPE projection to an array of original tokenizer ids.

                                                                                                                        Instances For