BugZoo: KV-cache contracts #
LLM inference-engine bug reports include cache-shift bugs, RoPE/position mismatches, shape mistakes, and resource/configuration errors. TorchLean does not verify a full serving engine, paged attention allocator, or multi-GPU scheduler. The useful first step is still precise: represent the cache update as a typed tensor operation and prove the append invariant we rely on.
Reference:
- Liu et al., "A First Look at Bugs in LLM Inference Engines", 2025.
This file proves the cache append boundary. A stronger future theorem should connect cached decode to full-sequence attention:
decodeWithCache prefix newToken = fullAttention (prefix ++ [newToken])
under the same mask, RoPE/position encoding, and numeric semantics.
A key/value cache with an explicit sequence length and head dimension.
- keys : Spec.Tensor α (Spec.Shape.dim seqLen (Spec.Shape.dim headDim Spec.Shape.scalar))
Cached key vectors, indexed by time.
- values : Spec.Tensor α (Spec.Shape.dim seqLen (Spec.Shape.dim headDim Spec.Shape.scalar))
Cached value vectors, indexed by time.
Instances For
View one token vector as a length-one sequence.
Instances For
Append one token vector to a sequence cache along the time axis.
Instances For
Append both key and value vectors to the KV cache.
Instances For
The newly appended key is exactly the final key in the updated cache.
The newly appended value is exactly the final value in the updated cache.