Compress your LLM's KV cache by 5–7× with near-zero accuracy loss. Run longer contexts, serve more users, use less GPU memory.
First open-source implementation of Google's TurboQuant (ICLR 2026). 3.5 bits/value = near-identical quality to FP16. Provably within 2.7× of information-theoretic optimal.
📄 Paper Results (Llama-3.1-8B-Instruct, LongBench — from the paper)
| Method | KV Bits | LongBench Avg | Needle-in-Haystack |
|---|---|---|---|
| Full Precision | 16 | 50.06 | 0.997 |
| TurboQuant | 3.5 | 50.06 | 0.997 |
| TurboQuant | 2.5 | 49.44 | 0.997 |
| PolarQuant | 3.9 | 49.78 | 0.995 |
| KIVI | 3 | 48.50 | 0.981 |
| SnapKV | — | 44.57 | 0.858 |
| Mode | Logit Cosine | Top-1 Match | KV Key Cosine | KV Value Cosine | Compression |
|---|---|---|---|---|---|
| 3.5-bit (default) | 0.963 | 80% (4/5) | 0.992 | 0.988 | 4.9× |
| 2.5-bit | 0.956 | 80% (4/5) | 0.973 | 0.961 | 7.1× |
Both modes use two independent rotations for outlier/regular channel subsets (Section 2.3) and online codebooks from actual data (Section 4.1).
Rotation modes: rotation_mode="hadamard" (default, O(d log d)) or rotation_mode="dense" (full random orthogonal via QR decomposition, O(d²)). Both satisfy P^T P = I exactly.
TurboQuant is a two-stage vector quantizer that achieves near-optimal compression:
Input KV vector (FP16, d=128)
│
▼
┌─────────────────────┐
│ Random Rotation Π │ Hadamard + random signs
│ y = Π · x │ O(d log d), preserves norms
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ Scalar Lloyd-Max │ Each coordinate independently
│ idx = quantize(y) │ b bits per coordinate
│ │ Beta dist ≈ N(0, 1/d)
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ QJL Residual │ 1-bit sign quantization
│ sign(S · residual) │ Unbiased inner products
└─────────┬───────────┘
│
▼
Compressed: (b+1) bits/coord + FP16 norm
= ~3.25 bits/value at b=2
= ~4.9× compression vs FP16
Key insight: After random rotation, each coordinate follows a Beta distribution that's near-independent of other coordinates. This means scalar quantization per coordinate is near-optimal — no coupling, no error compounding through deep models.
# Install
git clone https://github.com/OnlyTerp/turboquant.git
cd turboquant
pip install -e .
# Run demo (synthetic vectors, no GPU needed)
python src/demo.py
# Run real model validation (downloads TinyLlama or Nemotron-Nano-4B)
python src/test_real_model.py- Reference implementation — Pure PyTorch, not optimized for production throughput. Triton kernels are experimental.
- CPU attention is slow — The demo runs on CPU (~25× slower than FP16). GPU kernels needed for competitive speed.
- Mixed-precision is approximate — Our outlier channel detection differs from the paper's theoretically optimal two-independent-instances approach (see IMPLEMENTATION_NOTES.md).
- Tested on 2 models — Mistral-7B-Instruct and Nemotron-Nano-4B. More model validation needed.
- vLLM plugin is a scaffold — Not yet tested with actual vLLM serving.
TurboQuant implements two algorithms from the paper:
- Random rotation: Multiply by randomized Hadamard matrix Π
- Scalar quantization: Lloyd-Max codebook for Beta distribution, applied per coordinate
- Store: b-bit index per coordinate + FP16 norm
- Distortion bound: MSE ≤ √(3π/2) · 4^(-b)
- Apply TurboQuant_mse with (b-1) bits
- Compute residual: r = x - DeQuant(Quant(x))
- QJL: sign(S · r) where S has i.i.d. N(0,1) entries
- Unbiased: E[⟨y, x̂⟩] = ⟨y, x⟩ (no systematic bias)
- Total: b bits per coordinate
The related PolarQuant paper uses recursive polar coordinates, but TurboQuant deliberately avoids this. Recursive polar transforms couple coordinates through sin/cos operations at each level, causing errors to compound through deep models (7 levels for d=128). TurboQuant's scalar approach quantizes each coordinate independently — zero coupling, zero compounding.
turboquant/
├── src/
│ ├── cache.py # Core algorithm (encode/decode/cache/attention)
│ ├── demo.py # Synthetic benchmark
│ ├── test_real_model.py # Real transformer model validation
│ ├── test_turboquant.py # Unit tests (33 tests)
│ ├── kernels.py # Triton GPU kernels (experimental)
│ └── lut_attention.py # LUT-based attention (experimental)
├── vllm_plugin/ # vLLM integration scaffold
├── deploy/ # Docker deployment assets
├── BENCHMARKS.md # Detailed measurements
├── IMPLEMENTATION_NOTES.md # Implementation decisions
└── setup.py # Package installation
@inproceedings{zandieh2026turboquant,
title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}This is an independent open-source implementation of the TurboQuant algorithm. All credit for the algorithm design, theoretical analysis, and original research belongs to the paper authors.
- Paper: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Published at ICLR 2026
- Authors: Amir Zandieh (Google Research), Majid Daliri (NYU), Majid Hadian (Google DeepMind), Vahab Mirrokni (Google Research)
- Related work: PolarQuant, QJL by overlapping authors
- Implementation: Terp AI Labs
This implementation is not affiliated with or endorsed by Google Research, Google DeepMind, or NYU. We built it from the public paper to make TurboQuant accessible to the open-source community.
MIT — see LICENSE for details.
