BugZoo: tokenizer/import boundary #
Tokenization usually happens outside the tensor graph, which makes it easy for a model import to silently disagree about vocabulary size, padding, EOS, or special-token IDs. LLM inference-engine bug studies include tokenizer/config mismatch classes among real production failures:
https://arxiv.org/abs/2506.09713
TorchLean's current contract is intentionally small: once tokens enter the verified fragment, token
IDs can be represented as Fin vocabSize, making out-of-vocabulary IDs unrepresentable.
The padding token is in range by construction.