TorchLean API

NN.API.Text.Bpe

GPT-2 Byte-Pair Encoding #

Lean-native support for GPT-2-style byte-level BPE tokenizers.

This module is intentionally in NN.API.Text rather than a model file: any Transformer, diffusion LM, or verifier that wants GPT-2-compatible tokenization should share the same implementation. The implementation parses the standard vocab.json and merges.txt files directly in Lean.

The pre-tokenizer implements the GPT-2 regex shape:

's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

The Unicode \p{L}, \p{N}, and \s predicates are supplied by NN.API.Text.Unicode, rather than Lean's ASCII-oriented Char.isAlpha / Char.isDigit helpers.

Data #

One token-to-id entry from GPT-2's vocab.json.

  • token : String

    Token spelling after GPT-2 byte-to-unicode escaping.

  • id :

    Token id.

Instances For

    One ranked merge from GPT-2's merges.txt. Lower rank is applied earlier.

    • left : String

      Left symbol.

    • right : String

      Right symbol.

    • rank :

      Merge priority.

    Instances For
      Instances For

        Loaded GPT-2 BPE tokenizer.

        Instances For

          Byte Escaping #

          Instances For

            GPT-2 byte-to-unicode escape for one byte.

            Instances For

              Inverse of byteToChar, used when decoding BPE token strings back to UTF-8.

              Instances For

                Reversible GPT-2 byte-to-unicode escape for a string fragment.

                Instances For

                  Decode GPT-2 byte-to-unicode escaped text back into a UTF-8 string.

                  Instances For

                    Pre-tokenization #

                    Instances For

                      Consume the GPT-2 branch \s+(?!\S).

                      Python's regex engine greedily takes a whitespace run but may backtrack so the negative lookahead sees either end-of-input or another whitespace character. For a whitespace run before a non-space token, this consumes all but the final whitespace; the final ASCII space can then attach to the next letter/number/punctuation branch, matching GPT-2's standard token boundaries.

                      Instances For

                        Fuel-bounded worker for GPT-2 regex pre-token fragments before byte escaping and BPE merges.

                        The branch order mirrors GPT-2's tokenizer regex exactly: contractions, optional-space letter runs, optional-space number runs, optional-space non-space/non-letter/non-number runs, whitespace not followed by non-space, and finally a plain whitespace run. The fuel argument keeps this definition total and should never be exhausted when called by pretokenize.

                        Instances For

                          Split a string into GPT-2-style pre-token fragments.

                          Instances For

                            BPE Merging #

                            Apply BPE to one pre-tokenized fragment.

                            Instances For

                              Encode text using the loaded GPT-2 BPE files.

                              Instances For

                                Decode GPT-2 BPE ids back to text.

                                Instances For

                                  Total display-oriented decoder: invalid ids/UTF-8 decode to an empty string.

                                  Instances For

                                    Adapt a loaded GPT-2 BPE tokenizer to the generic text-tokenizer interface.

                                    Instances For

                                      File Loading #

                                      Parse GPT-2 vocab.json as an array of (token, id) entries.

                                      Instances For

                                        The standard GPT-2 vocab.json is a single flat JSON object from token strings to numeric ids. Using Lean's fully general JSON object parser is convenient but slow for interactive examples because it builds a 50k-entry tree before we immediately flatten it again. The small parser below recognizes exactly the JSON shape used by GPT-2 vocab files and decodes JSON string escapes, including \uXXXX escapes for byte-to-unicode code points.

                                        Instances For
                                          Instances For

                                            Parse GPT-2 vocab.json directly from text.

                                            Instances For

                                              Parse GPT-2 merges.txt. Invalid non-comment lines are ignored conservatively.

                                              Instances For

                                                Load GPT-2 BPE files directly in Lean.

                                                Instances For