Releases: back2matching/turboquant
Releases · back2matching/turboquant
v0.1.0 - First release
First open-source implementation of Google's TurboQuant KV cache compression (ICLR 2026).
What's included
- TurboQuant algorithms (MSE + inner-product optimal) from the paper
- HuggingFace DynamicCache drop-in with KIVI-style residual window
- OpenAI-compatible inference server (
turboquant-server) - Benchmarks on RTX 4080 (first consumer GPU TurboQuant results anywhere)
Install
pip install turboquant
Quick start
from turboquant import TurboQuantCache
cache = TurboQuantCache(bits=4)
outputs = model.generate(..., past_key_values=cache)Benchmarks (Qwen2.5-3B on RTX 4080 16GB)
| KV Mode | Peak VRAM | Speed | Quality |
|---|---|---|---|
| FP16 (baseline) | 6,922 MB | 28 tok/s | Perfect |
| TurboQuant 4-bit | 6,448 MB (-474 MB) | 17 tok/s | Good |