TurboQuant

Near-optimal vector quantization for LLM KV cache compression.
3-bit quantization with minimal accuracy loss and up to 8x memory reduction.

A Python implementation of the TurboQuant algorithm (ICLR 2026) by Zandieh et al.
PolarQuant (Lloyd-Max on rotated coordinates) + QJL (1-bit residual correction for unbiased inner products).

Highlights

3-bit KV cache compression with near-optimal distortion rate
Two-stage pipeline: PolarQuant (MSE-optimal scalar quantization) + QJL (unbiased inner product estimation)
Lloyd-Max quantizer built on the exact Beta(d/2, d/2) distribution that arises from random rotation
No calibration data required — works online, token-by-token
Simple API — compress, decompress, and estimate inner products in a few lines
KV cache simulator included for attention score benchmarking
Pure Python + NumPy/SciPy — no custom CUDA kernels, easy to read and extend

How It Works

TurboQuant compresses high-dimensional vectors (e.g., KV cache entries in transformer attention) in two stages:

Stage 1: PolarQuant

Random rotation: Multiply the input vector by a fixed random orthogonal matrix. This makes each coordinate follow a known Beta(d/2, d/2) distribution.
Lloyd-Max scalar quantization: Quantize each rotated coordinate independently using a Lloyd-Max quantizer optimized for the Beta distribution. This achieves near-optimal MSE distortion.

Stage 2: QJL (Quantized Johnson-Lindenstrauss)

Compute residual: The difference between the rotated vector and its quantized reconstruction.
1-bit projection: Project the residual onto random Rademacher vectors and store only the signs (1 bit each).
Inner product correction: At query time, use the stored signs to correct the inner product estimate, making it unbiased.

Input vector x ∈ R^d
    │
    ▼
┌─────────────────────┐
│  Rotate: y = Π·x    │  (random orthogonal matrix Π)
└─────────┬───────────┘
          ▼
┌─────────────────────┐
│  Lloyd-Max quantize  │  (B bits per coordinate)
│  indices = Q(y)      │
└─────────┬───────────┘
          ▼
┌─────────────────────┐
│  QJL residual signs  │  (1 bit per projection dimension)
│  s = sign(S·(y-ŷ))  │
└─────────┬───────────┘
          ▼
   CompressedVector(indices, signs)

Variant	What it stores	Best for
TurboQuant_mse	Stage 1 only (indices)	Reconstruction tasks, lowest memory
TurboQuant_prod	Stage 1 + Stage 2 (indices + signs)	Attention/inner product tasks

Installation

git clone https://github.com/ryuketsukami/turboquant-compression.git
cd turboquant-compression
pip install -e .

Or install dependencies directly:

pip install numpy scipy

Requirements

Python 3.10+
NumPy >= 1.24
SciPy >= 1.10

Quick Start

import numpy as np
from turboquant import TurboQuant, TurboQuantConfig

# Configure: 256-dimensional vectors, 3-bit quantization
config = TurboQuantConfig(dimension=256, bits=3, qjl_enabled=True)
tq = TurboQuant(config)

# Compress a vector
x = np.random.randn(256)
x = x / np.linalg.norm(x)  # unit normalize

compressed = tq.compress(x)
reconstructed = tq.decompress(compressed)

print(f"Compression ratio: {tq.compression_ratio():.1f}x")
print(f"MSE: {np.mean((x - reconstructed) ** 2):.6f}")
print(f"Cosine similarity: {np.dot(x, reconstructed) / np.linalg.norm(reconstructed):.4f}")

KV Cache Simulation

from turboquant import TurboQuantKVCache, TurboQuantConfig

config = TurboQuantConfig(dimension=128, bits=3, qjl_enabled=True)
cache = TurboQuantKVCache(config)

# Simulate a sequence of KV pairs
for t in range(64):
    key = np.random.randn(128)
    value = np.random.randn(128)
    cache.append(key / np.linalg.norm(key), value / np.linalg.norm(value))

# Compute attention scores against a query
query = np.random.randn(128)
query = query / np.linalg.norm(query)
scores = cache.attention_scores(query)

Inner Product Estimation

# TurboQuant_prod gives unbiased inner product estimates
key = np.random.randn(256)
key = key / np.linalg.norm(key)
compressed_key = tq.compress(key)

query = np.random.randn(256)
query = query / np.linalg.norm(query)

estimated_ip = tq.inner_product(query, compressed_key)
true_ip = np.dot(query, key)
print(f"True: {true_ip:.4f}, Estimated: {estimated_ip:.4f}")

Benchmarks

Results on random unit vectors (d=256, n=64, seed=0). Measured on Python 3.14, Windows 11.

Bits	Variant	MSE	Cosine Sim	IP Correlation	Compression
2	MSE	0.000461	0.9394	—	16.0x
2	Prod (+QJL)	—	—	0.8402	10.7x
3	MSE	0.000133	0.9829	—	10.7x
3	Prod (+QJL)	—	—	0.8605	8.0x
4	MSE	0.000037	0.9952	—	8.0x
4	Prod (+QJL)	—	—	0.8604	6.4x

Note: These are results for d=256. The paper reports near-zero accuracy loss at production dimensions (d=4096+), where the Beta distribution concentrates more tightly and quantization error drops significantly.

Run the self-test to reproduce these numbers:

python -m turboquant           # as installed package
python turboquant/turboquant.py  # directly

Run the test suite:

pytest tests/ -v               # 27 tests

API Reference

`TurboQuantConfig`

Parameter	Type	Default	Description
`dimension`	`int`	required	Vector dimension (d)
`bits`	`int`	`4`	Bit-width for PolarQuant (2-4 typical)
`qjl_enabled`	`bool`	`True`	Enable QJL residual correction
`qjl_dim`	`int`	`0`	QJL projection dimension (0 = auto = d)
`seed`	`int`	`42`	Random seed for reproducibility

`TurboQuant`

Method	Description
`compress(x)`	Compress a vector, returns `CompressedVector`
`decompress(compressed)`	Reconstruct from PolarQuant (Stage 1)
`inner_product(query, compressed_key)`	Unbiased inner product estimate (Stage 1 + 2)
`compression_ratio()`	Original bits / compressed bits
`compressed_size_bytes()`	Size of one compressed vector in bytes

`TurboQuantKVCache`

Method	Description
`append(key, value)`	Add a KV pair to the compressed cache
`attention_scores(query)`	Compute attention logits for all cached keys

Project Structure

turboquant-compression/
├── turboquant/             # Package
│   ├── __init__.py         # Core implementation + public API
│   └── __main__.py         # python -m turboquant entry point
├── turboquant.py           # Standalone script (self-test + demo)
├── tests/
│   └── test_turboquant.py  # 27 tests (pytest)
├── examples/
│   ├── basic_usage.py      # Getting started example
│   └── kv_cache_demo.py    # KV cache simulation
├── .github/workflows/
│   └── tests.yml           # CI — tests on Python 3.10-3.13
├── pyproject.toml          # Package configuration
├── requirements.txt        # Dependencies
├── LICENSE                 # MIT License
├── CONTRIBUTING.md         # Contribution guidelines
├── SECURITY.md             # Security policy
├── CITATION.cff            # Citation metadata
└── CHANGELOG.md            # Version history

Looking for the agent skill version? See turboquant-skill for a ready-to-use Claude Code / AI agent skill that wraps this implementation.

Background: The TurboQuant Paper

This implementation is based on:

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni ICLR 2026 • arXiv:2504.19874

Key theoretical contributions of the paper:

PolarQuant achieves the rate-distortion bound for unit vectors using random rotation + optimal scalar quantization
QJL (Quantized Johnson-Lindenstrauss) provides unbiased inner product estimation from 1-bit projections of quantization residuals
The combination enables 3-bit KV cache compression with zero accuracy loss on standard LLM benchmarks
Demonstrated 6x memory reduction and 8x faster attention on NVIDIA H100 GPUs

Related Work

Project	Focus	Comparison
GPTQ	Weight quantization	Compresses model weights, not KV cache
AWQ	Activation-aware weight quantization	Weight-focused, different technique
KIVI	KV cache quantization	TurboQuant improves upon KIVI's approach
kvpress	KV cache compression toolkit	Multi-method toolkit, different algorithms

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

Areas where contributions are especially valued:

GPU-accelerated kernels (CUDA/Triton)
Integration with inference frameworks (vLLM, llama.cpp, TensorRT-LLM)
Additional quantization bit-widths and mixed-precision support
Benchmarks on real LLM KV cache data

Citation

If you use this implementation in your research, please cite:

@inproceedings{zandieh2026turboquant,
  title     = {TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author    = {Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2504.19874}
}

License

This project is licensed under the MIT License. See LICENSE for details.

_{Built with NumPy and SciPy. Inspired by Google Research.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurboQuant

Highlights

How It Works

Stage 1: PolarQuant

Stage 2: QJL (Quantized Johnson-Lindenstrauss)

Installation

Requirements

Quick Start

KV Cache Simulation

Inner Product Estimation

Benchmarks

API Reference

`TurboQuantConfig`

`TurboQuant`

`TurboQuantKVCache`

Project Structure

Background: The TurboQuant Paper

Related Work

Contributing

Citation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
examples		examples
tests		tests
turboquant		turboquant
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
codemeta.json		codemeta.json
llms.txt		llms.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
turboquant.py		turboquant.py

Folders and files

Latest commit

History

Repository files navigation

TurboQuant

Highlights

How It Works

Stage 1: PolarQuant

Stage 2: QJL (Quantized Johnson-Lindenstrauss)

Installation

Requirements

Quick Start

KV Cache Simulation

Inner Product Estimation

Benchmarks

API Reference

TurboQuantConfig

TurboQuant

TurboQuantKVCache

Project Structure

Background: The TurboQuant Paper

Related Work

Contributing

Citation

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`TurboQuantConfig`

`TurboQuant`

`TurboQuantKVCache`

Packages