Releases · back2matching/turboquant · GitHub

25 Mar 07:04

v0.1.0 - First release Latest

Latest

First open-source implementation of Google's TurboQuant KV cache compression (ICLR 2026).

What's included

TurboQuant algorithms (MSE + inner-product optimal) from the paper
HuggingFace DynamicCache drop-in with KIVI-style residual window
OpenAI-compatible inference server (turboquant-server)
Benchmarks on RTX 4080 (first consumer GPU TurboQuant results anywhere)

Install

pip install turboquant

Quick start

from turboquant import TurboQuantCache

cache = TurboQuantCache(bits=4)
outputs = model.generate(..., past_key_values=cache)

Benchmarks (Qwen2.5-3B on RTX 4080 16GB)

KV Mode	Peak VRAM	Speed	Quality
FP16 (baseline)	6,922 MB	28 tok/s	Perfect
TurboQuant 4-bit	6,448 MB (-474 MB)	17 tok/s	Good

Paper: https://arxiv.org/abs/2504.19874

Assets 4