BugZoo: tokenizer/import boundary #

Tokenization usually happens outside the tensor graph, which makes it easy for a model import to silently disagree about vocabulary size, padding, EOS, or special-token IDs. LLM inference-engine bug studies include tokenizer/config mismatch classes among real production failures:

https://arxiv.org/abs/2506.09713

TorchLean's current contract is intentionally small: once tokens enter the verified fragment, token IDs can be represented as Fin vocabSize, making out-of-vocabulary IDs unrepresentable.

source

structure NN.Examples.BugZoo.TokenizerBoundary.TokenizerContract :

Type

The tokenizer metadata that must agree with the model's embedding table.