Skip to content

Releases: back2matching/turboquant

v0.1.0 - First release

25 Mar 07:04

Choose a tag to compare

First open-source implementation of Google's TurboQuant KV cache compression (ICLR 2026).

What's included

  • TurboQuant algorithms (MSE + inner-product optimal) from the paper
  • HuggingFace DynamicCache drop-in with KIVI-style residual window
  • OpenAI-compatible inference server (turboquant-server)
  • Benchmarks on RTX 4080 (first consumer GPU TurboQuant results anywhere)

Install

pip install turboquant

Quick start

from turboquant import TurboQuantCache

cache = TurboQuantCache(bits=4)
outputs = model.generate(..., past_key_values=cache)

Benchmarks (Qwen2.5-3B on RTX 4080 16GB)

KV Mode Peak VRAM Speed Quality
FP16 (baseline) 6,922 MB 28 tok/s Perfect
TurboQuant 4-bit 6,448 MB (-474 MB) 17 tok/s Good

Paper: https://arxiv.org/abs/2504.19874