TurboQuant

Your LLM runs 2x faster at long context. When the KV cache fills your VRAM, everything grinds. TurboQuant compresses it — and the speed comes back.

	FP16 (baseline)	TurboQuant 4-bit
Qwen 3B @ 4K context	2.5 tok/s (thrashing)	7.4 tok/s
VRAM saved	—	1 GB
Qwen 7B @ 2K context	1.0 tok/s (OOM)	1.4 tok/s

Drop-in for any HuggingFace model:

from turboquant import TurboQuantCache

# Symmetric: 4-bit keys + 4-bit values
cache = TurboQuantCache(bits=4)

# Asymmetric: 4-bit keys + 2-bit values (better quality, less memory)
cache = TurboQuantCache(key_bits=4, value_bits=2)

# Protect sensitive layers at full FP16 precision
cache = TurboQuantCache(key_bits=4, value_bits=2, protected_layers=[0, 1, -1, -2])

outputs = model(**inputs, past_key_values=cache, use_cache=True)

pip install turboquant

Why this matters

When LLMs generate text, they store key-value pairs for every token. This KV cache grows with context length and eats your VRAM. On a 16 GB GPU running a 3B model, the KV cache alone hits 1.2 GB at 4K tokens — and FP16 starts thrashing.

TurboQuant compresses the cache to 4 bits (from 16) using Google's TurboQuant algorithm (ICLR 2026). No training data, no calibration, works with any model. The result: your GPU has room to breathe, and inference stays fast where it used to choke.

Install

pip install turboquant

Or from source:

git clone https://github.com/back2matching/turboquant
cd turboquant
pip install -e .

Quick Start

Drop into any HuggingFace model

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache
import torch

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# Create compressed cache
cache = TurboQuantCache(bits=4)

# Use it like normal
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model(**inputs, past_key_values=cache, use_cache=True)

Run the inference server

TurboQuant ships with an OpenAI-compatible inference server. Point any OpenAI client at it.

turboquant-server --model Qwen/Qwen2.5-3B-Instruct --bits 4 --port 8000

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'

Use the core algorithms directly

from turboquant import TurboQuantMSE

# Quantize any vectors (KV cache heads, embeddings, etc.)
tq = TurboQuantMSE(dim=128, bits=4, device='cuda')

# Quantize
indices, norms = tq.quantize(vectors)  # vectors: (N, 128)

# Dequantize
vectors_hat = tq.dequantize(indices, norms)

Benchmarks (RTX 4080 16GB)

Independent benchmarks on NVIDIA RTX 4080 (16 GB VRAM), PyTorch 2.5.1, CUDA 12.1. 45 data points across 4 models.

Reproduce:

python benchmarks/benchmark_kv.py --model Qwen/Qwen2.5-3B-Instruct --context "512,1024,2048,4096"
python benchmarks/benchmark_kv.py --model Qwen/Qwen2.5-7B-Instruct --quick  # fast sanity check

Results are saved per-model (benchmarks/results_*.json) and combined (benchmarks/benchmark_results.json).

Qwen2.5-7B-Instruct (14.5 GB model weights)

Context	KV Mode	Peak VRAM	VRAM Saved	Speed (tok/s)	Output Quality
460	FP16	14,833 MB	--	17.7	Coherent
460	TQ 4-bit	14,758 MB	75 MB	23.8	Coherent
460	TQ 3-bit	14,758 MB	75 MB	20.6	Minor artifacts
1860	FP16	16,659 MB	--	1.0	Coherent
1860	TQ 4-bit	16,215 MB	444 MB	1.4	Coherent
1860	TQ 3-bit	16,217 MB	442 MB	1.4	Coherent

At 7B with 1.8K context, FP16 exceeds physical VRAM (16,659 > 16,376 MB) and drops to 1 tok/s from swapping. TQ-4bit saves 444 MB and runs 40% faster in this regime.

Qwen2.5-3B-Instruct — Context Length Sweep (5.9 GB model weights)

Context	KV Mode	Peak VRAM	VRAM Saved	Speed (tok/s)
460	FP16	6,126 MB	--	14.6
460	TQ 4-bit	6,075 MB	51 MB	7.8
930	FP16	6,451 MB	--	14.1
930	TQ 4-bit	6,260 MB	191 MB	7.4
1860	FP16	7,359 MB	--	15.4
1860	TQ 4-bit	6,835 MB	524 MB	15.5
3720	FP16	10,222 MB	--	2.5
3720	TQ 4-bit	9,174 MB	1,048 MB	7.4

VRAM savings scale with context length: 51 MB at 512 tokens up to 1,048 MB at 4K tokens. At 4K context, FP16 hits memory pressure (2.5 tok/s) while TQ-4bit with nibble packing runs at 7.4 tok/s — 196% faster.

Qwen2.5-0.5B-Instruct — Long Context (942 MB model weights)

Context	FP16 Peak	TQ 4-bit Peak	VRAM Saved	FP16 Speed	TQ 4-bit Speed
460	1,144 MB	1,104 MB	40 MB	44.3	30.5
930	1,417 MB	1,262 MB	155 MB	46.1	30.3
1860	2,189 MB	1,669 MB	520 MB	41.7	29.1
3720	4,654 MB	3,621 MB	1,033 MB	31.9	26.5
7440	13,265 MB	11,195 MB	2,070 MB	17.8	19.8

At 8K context, TQ-4bit saves 2 GB of VRAM and is 11% faster than FP16. 16K OOM'd for all modes on 16 GB.

StableLM-2-1.6B — Cross-Architecture (3.1 GB model weights)

Context	FP16 Peak	TQ 4-bit Peak	VRAM Diff	FP16 Speed	TQ 4-bit Speed
460	3,433 MB	3,488 MB	+55 MB	68.9	36.7
930	3,724 MB	3,894 MB	+170 MB	68.2	34.8
1860	4,302 MB	4,700 MB	+398 MB	61.4	34.7
3720	5,459 MB	6,318 MB	+859 MB	56.1	33.1

On StableLM, TQ uses more VRAM than FP16 at every context length. The StableLM results were collected with v0.1.0 (dequantized storage). v0.2.0 stores compressed indices and may show different results on StableLM.

Key Takeaways

VRAM savings scale linearly with context length. At short contexts (<512 tokens), savings are minimal. At 4K tokens, savings exceed 1 GB. At 8K, savings reach 2 GB.
Under memory pressure, TQ is significantly faster than FP16. At 4K context on 3B, FP16 drops to 3.5 tok/s while TQ-4bit runs at 6.1 tok/s (74% faster). At 8K on 0.5B, TQ is 11% faster.
v0.2.0 stores compressed indices. Cache uses uint8 indices + float32 norms instead of dequantized FP16. Real compression with on-the-fly dequantization.
Output quality is good at 4-bit on 3B+ models. Qwen 3B and 7B produce coherent code. On 0.5B, TQ output sometimes degrades to filler repetition — small models are more sensitive to quantization noise.

Algorithm Verification

Bits	MSE	Theoretical Bound	Compression
1	0.362	0.680	12.8x
2	0.129	0.170	7.1x
3	0.049	0.043	4.9x
4	0.020	0.011	3.8x

How It Works

TurboQuant uses three ideas from the paper, plus community-validated optimizations:

Random rotation: Multiply each KV vector by a random orthogonal matrix. This spreads the information evenly across all coordinates, making them nearly independent.
Optimal codebook: Each coordinate now follows a predictable Beta distribution. We compute the mathematically optimal quantization levels for this distribution. No training data needed.
Residual window: The most recent 128 tokens stay in full FP16 precision. Only older tokens get compressed. This preserves quality for the tokens attention focuses on most.

v0.3.0 additions (adopted from community findings across 11 TurboQuant implementations):

Asymmetric K/V allocation: Keys need more bits than values — K/V norm disparity can exceed 1000x. Default: 4-bit keys + 2-bit values for the best quality/memory tradeoff.
Layer-adaptive precision: First and last transformer layers are most sensitive. protected_layers=[0, 1, -1, -2] keeps them at full FP16 while compressing middle layers.
MSE-only quantization: Six independent teams confirmed QJL (Algorithm 2 from the paper) hurts attention quality. We use MSE-optimal quantization only (Algorithm 1). TurboQuantIP is deprecated.

The rotation is computed once (not per-token) and the codebook is derived analytically. No calibration, no fine-tuning, works with any model out of the box.

When to Use This

Good fit:

You're running long contexts (8K+ tokens) on a VRAM-constrained GPU
You're serving multiple users and need to fit more KV caches in memory
You want to run a bigger model by freeing VRAM from KV cache
Standard transformer models (Llama, Mistral, Qwen2.5)

Not a good fit:

Very short contexts (< 1K tokens) where KV cache is tiny anyway
Hybrid architectures with recurrent layers (Qwen3.5, Mamba) that already have small KV caches
Tasks requiring exact bit-level precision (use FP16)
3-bit on models smaller than 8B (quality degrades noticeably)

Comparison with Alternatives

Method	Where It Runs	Bits	Setup
TurboQuant	Any HuggingFace model	3-4	`pip install turboquant`
Ollama q8_0 KV	Ollama only	8	`OLLAMA_KV_CACHE_TYPE=q8_0`
Ollama q4_0 KV	Ollama only	4	`OLLAMA_KV_CACHE_TYPE=q4_0`
vLLM FP8 KV	vLLM only	8	`kv_cache_dtype="fp8"`
KIVI	Research code	2	Not pip-installable

TurboQuant is the only pip-installable sub-8-bit KV cache compression that works with any HuggingFace model.

llama.cpp Integration

A TQ4_0 KV cache type was proposed for llama.cpp:

PR: ggml-org/llama.cpp#20995 (closed — premature, multiple competing implementations in progress)
Usage (if built from branch): --cache-type-k tq4_0 --cache-type-v f16 --no-kv-offload
Status: Multiple community implementations in progress. Google's official code expected Q2 2026.

Paper

This implements the algorithm from:

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni ICLR 2026 | arXiv:2504.19874

This is an independent implementation, not affiliated with Google Research.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
benchmarks		benchmarks
cuda		cuda
docs		docs
examples		examples
tests		tests
turboquant		turboquant
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurboQuant

Why this matters

Install

Quick Start

Drop into any HuggingFace model

Run the inference server

Use the core algorithms directly

Benchmarks (RTX 4080 16GB)

Qwen2.5-7B-Instruct (14.5 GB model weights)

Qwen2.5-3B-Instruct — Context Length Sweep (5.9 GB model weights)

Qwen2.5-0.5B-Instruct — Long Context (942 MB model weights)

StableLM-2-1.6B — Cross-Architecture (3.1 GB model weights)

Key Takeaways

Algorithm Verification

How It Works

When to Use This

Comparison with Alternatives

llama.cpp Integration

Paper

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

TurboQuant

Why this matters

Install

Quick Start

Drop into any HuggingFace model

Run the inference server

Use the core algorithms directly

Benchmarks (RTX 4080 16GB)

Qwen2.5-7B-Instruct (14.5 GB model weights)

Qwen2.5-3B-Instruct — Context Length Sweep (5.9 GB model weights)

Qwen2.5-0.5B-Instruct — Long Context (942 MB model weights)

StableLM-2-1.6B — Cross-Architecture (3.1 GB model weights)

Key Takeaways

Algorithm Verification

How It Works

When to Use This

Comparison with Alternatives

llama.cpp Integration

Paper

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Languages

Packages