Near-optimal vector quantization for LLM KV cache compression.
3-bit quantization with minimal accuracy loss and up to 8x memory reduction.
A Python implementation of the TurboQuant algorithm (ICLR 2026) by Zandieh et al.
PolarQuant (Lloyd-Max on rotated coordinates) + QJL (1-bit residual correction for unbiased inner products).
- 3-bit KV cache compression with near-optimal distortion rate
- Two-stage pipeline: PolarQuant (MSE-optimal scalar quantization) + QJL (unbiased inner product estimation)
- Lloyd-Max quantizer built on the exact Beta(d/2, d/2) distribution that arises from random rotation
- No calibration data required — works online, token-by-token
- Simple API — compress, decompress, and estimate inner products in a few lines
- KV cache simulator included for attention score benchmarking
- Pure Python + NumPy/SciPy — no custom CUDA kernels, easy to read and extend
TurboQuant compresses high-dimensional vectors (e.g., KV cache entries in transformer attention) in two stages:
- Random rotation: Multiply the input vector by a fixed random orthogonal matrix. This makes each coordinate follow a known Beta(d/2, d/2) distribution.
- Lloyd-Max scalar quantization: Quantize each rotated coordinate independently using a Lloyd-Max quantizer optimized for the Beta distribution. This achieves near-optimal MSE distortion.
- Compute residual: The difference between the rotated vector and its quantized reconstruction.
- 1-bit projection: Project the residual onto random Rademacher vectors and store only the signs (1 bit each).
- Inner product correction: At query time, use the stored signs to correct the inner product estimate, making it unbiased.
Input vector x ∈ R^d
│
▼
┌─────────────────────┐
│ Rotate: y = Π·x │ (random orthogonal matrix Π)
└─────────┬───────────┘
▼
┌─────────────────────┐
│ Lloyd-Max quantize │ (B bits per coordinate)
│ indices = Q(y) │
└─────────┬───────────┘
▼
┌─────────────────────┐
│ QJL residual signs │ (1 bit per projection dimension)
│ s = sign(S·(y-ŷ)) │
└─────────┬───────────┘
▼
CompressedVector(indices, signs)
| Variant | What it stores | Best for |
|---|---|---|
| TurboQuant_mse | Stage 1 only (indices) | Reconstruction tasks, lowest memory |
| TurboQuant_prod | Stage 1 + Stage 2 (indices + signs) | Attention/inner product tasks |
git clone https://github.com/ryuketsukami/turboquant-compression.git
cd turboquant-compression
pip install -e .Or install dependencies directly:
pip install numpy scipy- Python 3.10+
- NumPy >= 1.24
- SciPy >= 1.10
import numpy as np
from turboquant import TurboQuant, TurboQuantConfig
# Configure: 256-dimensional vectors, 3-bit quantization
config = TurboQuantConfig(dimension=256, bits=3, qjl_enabled=True)
tq = TurboQuant(config)
# Compress a vector
x = np.random.randn(256)
x = x / np.linalg.norm(x) # unit normalize
compressed = tq.compress(x)
reconstructed = tq.decompress(compressed)
print(f"Compression ratio: {tq.compression_ratio():.1f}x")
print(f"MSE: {np.mean((x - reconstructed) ** 2):.6f}")
print(f"Cosine similarity: {np.dot(x, reconstructed) / np.linalg.norm(reconstructed):.4f}")from turboquant import TurboQuantKVCache, TurboQuantConfig
config = TurboQuantConfig(dimension=128, bits=3, qjl_enabled=True)
cache = TurboQuantKVCache(config)
# Simulate a sequence of KV pairs
for t in range(64):
key = np.random.randn(128)
value = np.random.randn(128)
cache.append(key / np.linalg.norm(key), value / np.linalg.norm(value))
# Compute attention scores against a query
query = np.random.randn(128)
query = query / np.linalg.norm(query)
scores = cache.attention_scores(query)# TurboQuant_prod gives unbiased inner product estimates
key = np.random.randn(256)
key = key / np.linalg.norm(key)
compressed_key = tq.compress(key)
query = np.random.randn(256)
query = query / np.linalg.norm(query)
estimated_ip = tq.inner_product(query, compressed_key)
true_ip = np.dot(query, key)
print(f"True: {true_ip:.4f}, Estimated: {estimated_ip:.4f}")Results on random unit vectors (d=256, n=64, seed=0). Measured on Python 3.14, Windows 11.
| Bits | Variant | MSE | Cosine Sim | IP Correlation | Compression |
|---|---|---|---|---|---|
| 2 | MSE | 0.000461 | 0.9394 | — | 16.0x |
| 2 | Prod (+QJL) | — | — | 0.8402 | 10.7x |
| 3 | MSE | 0.000133 | 0.9829 | — | 10.7x |
| 3 | Prod (+QJL) | — | — | 0.8605 | 8.0x |
| 4 | MSE | 0.000037 | 0.9952 | — | 8.0x |
| 4 | Prod (+QJL) | — | — | 0.8604 | 6.4x |
Note: These are results for d=256. The paper reports near-zero accuracy loss at production dimensions (d=4096+), where the Beta distribution concentrates more tightly and quantization error drops significantly.
Run the self-test to reproduce these numbers:
python -m turboquant # as installed package
python turboquant/turboquant.py # directlyRun the test suite:
pytest tests/ -v # 27 tests| Parameter | Type | Default | Description |
|---|---|---|---|
dimension |
int |
required | Vector dimension (d) |
bits |
int |
4 |
Bit-width for PolarQuant (2-4 typical) |
qjl_enabled |
bool |
True |
Enable QJL residual correction |
qjl_dim |
int |
0 |
QJL projection dimension (0 = auto = d) |
seed |
int |
42 |
Random seed for reproducibility |
| Method | Description |
|---|---|
compress(x) |
Compress a vector, returns CompressedVector |
decompress(compressed) |
Reconstruct from PolarQuant (Stage 1) |
inner_product(query, compressed_key) |
Unbiased inner product estimate (Stage 1 + 2) |
compression_ratio() |
Original bits / compressed bits |
compressed_size_bytes() |
Size of one compressed vector in bytes |
| Method | Description |
|---|---|
append(key, value) |
Add a KV pair to the compressed cache |
attention_scores(query) |
Compute attention logits for all cached keys |
turboquant-compression/
├── turboquant/ # Package
│ ├── __init__.py # Core implementation + public API
│ └── __main__.py # python -m turboquant entry point
├── turboquant.py # Standalone script (self-test + demo)
├── tests/
│ └── test_turboquant.py # 27 tests (pytest)
├── examples/
│ ├── basic_usage.py # Getting started example
│ └── kv_cache_demo.py # KV cache simulation
├── .github/workflows/
│ └── tests.yml # CI — tests on Python 3.10-3.13
├── pyproject.toml # Package configuration
├── requirements.txt # Dependencies
├── LICENSE # MIT License
├── CONTRIBUTING.md # Contribution guidelines
├── SECURITY.md # Security policy
├── CITATION.cff # Citation metadata
└── CHANGELOG.md # Version history
Looking for the agent skill version? See turboquant-skill for a ready-to-use Claude Code / AI agent skill that wraps this implementation.
This implementation is based on:
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni ICLR 2026 • arXiv:2504.19874
Key theoretical contributions of the paper:
- PolarQuant achieves the rate-distortion bound for unit vectors using random rotation + optimal scalar quantization
- QJL (Quantized Johnson-Lindenstrauss) provides unbiased inner product estimation from 1-bit projections of quantization residuals
- The combination enables 3-bit KV cache compression with zero accuracy loss on standard LLM benchmarks
- Demonstrated 6x memory reduction and 8x faster attention on NVIDIA H100 GPUs
| Project | Focus | Comparison |
|---|---|---|
| GPTQ | Weight quantization | Compresses model weights, not KV cache |
| AWQ | Activation-aware weight quantization | Weight-focused, different technique |
| KIVI | KV cache quantization | TurboQuant improves upon KIVI's approach |
| kvpress | KV cache compression toolkit | Multi-method toolkit, different algorithms |
Contributions are welcome! See CONTRIBUTING.md for guidelines.
Areas where contributions are especially valued:
- GPU-accelerated kernels (CUDA/Triton)
- Integration with inference frameworks (vLLM, llama.cpp, TensorRT-LLM)
- Additional quantization bit-widths and mixed-precision support
- Benchmarks on real LLM KV cache data
If you use this implementation in your research, please cite:
@inproceedings{zandieh2026turboquant,
title = {TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author = {Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026},
url = {https://arxiv.org/abs/2504.19874}
}This project is licensed under the MIT License. See LICENSE for details.
Built with NumPy and SciPy. Inspired by Google Research.