TorchLean API

NN.Examples.BugZoo.TokenizerBoundary

BugZoo: tokenizer/import boundary #

Tokenization usually happens outside the tensor graph, which makes it easy for a model import to silently disagree about vocabulary size, padding, EOS, or special-token IDs. LLM inference-engine bug studies include tokenizer/config mismatch classes among real production failures:

https://arxiv.org/abs/2506.09713

TorchLean's current contract is intentionally small: once tokens enter the verified fragment, token IDs can be represented as Fin vocabSize, making out-of-vocabulary IDs unrepresentable.

The tokenizer metadata that must agree with the model's embedding table.

Instances For

    A token sequence whose IDs are statically bounded by the vocabulary size.

    • tokenAt : Fin seqLenFin vocabSize
    Instances For

      The padding token is in range by construction.

      theorem NN.Examples.BugZoo.TokenizerBoundary.tokenAt_in_vocab {vocabSize seqLen : Nat} (seq : TokenSeq vocabSize seqLen) (i : Fin seqLen) :
      (seq.tokenAt i) < vocabSize

      Every imported token ID is in range by construction.