OpenAI-compatible LLM inference with TurboQuant KV cache compression.
Default: Llama-3.1-70B-Instruct at 3.5-bit compression (~4.57× KV memory reduction,
quality-neutral per the paper's LongBench results).
Built by Eviox Tech — HPC & GPU infrastructure.
| Full-precision | TurboQuant 3.5-bit | TurboQuant 2.5-bit | |
|---|---|---|---|
| KV cache per token | 16 bits/ch | ~3.5 bits/ch | ~2.5 bits/ch |
| Compression ratio | 1× | 4.57× | 6.4× |
| Quality (LongBench) | baseline | ≈ baseline | −0.6 avg pts |
| Equiv. context @ 80GB A100 | ~32k | ~146k | ~205k |
| Indexing time (d=1536) | — | 0.001s (vs 240s PQ) | 0.001s |
turboquant-server/
├── app/
│ ├── server.py # FastAPI — all HTTP endpoints
│ ├── engine.py # TurboQuantEngine — model load, generation
│ └── config.py # ServerConfig (env-driven)
├── docker/
│ ├── Dockerfile
│ ├── docker-compose.yml
│ ├── prometheus.yml
│ └── grafana/
│ └── provisioning/ # auto-provisioned dashboard
├── scripts/
│ └── run_benchmark.py # CLI benchmark tool
├── tests/
│ └── smoke_test.py
├── turboquant_corrected.py # TurboQuant core (corrected implementation)
└── requirements.txt
- Docker + NVIDIA Container Toolkit
- HuggingFace token with access to
meta-llama/Llama-3.1-70B-Instruct - 2× A100 80GB (or 4× A100 40GB) for 70B in float16
export HF_TOKEN=hf_your_token_here
export API_KEY=your_secret_key # optional bearer auth
cd docker
docker compose up -dFirst startup: ~10–15 min (model download + codebook precomputation).
Subsequent starts: ~3–5 min (model load from cache).
# Health
curl http://localhost:8000/health
# Ready (waits for model)
curl http://localhost:8000/ready
# Models
curl http://localhost:8000/v1/modelsOpenAI-compatible. Drop-in replacement — change only the base URL.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-70b-turbo",
"messages": [
{"role": "system", "content": "You are a GPFS expert."},
{"role": "user", "content": "Explain stripe width in GPFS."}
],
"max_tokens": 256,
"kv_bits": 3.5
}'TurboQuant extension: kv_bits (3.5 or 2.5) overrides the server default per-request.
Response includes x_turboquant.kv_compression_ratio in the body.
curl http://localhost:8000/v1/completions \
-d '{"model":"llama-3.1-70b-turbo","prompt":"GPFS stripe width is","max_tokens":50}'Run a structured compression benchmark:
curl http://localhost:8000/v1/benchmark \
-H "Content-Type: application/json" \
-d '{
"prompt_tokens": 32768,
"max_new_tokens": 128,
"bits_to_compare": [2.5, 3.5],
"runs_per_config": 3,
"task": "needle"
}'Tasks: needle (retrieval), summarize, qa
Or use the CLI:
python scripts/run_benchmark.py \
--url http://localhost:8000 \
--task needle \
--prompt-tokens 32768 \
--bits 2.5 3.5 \
--runs 3 \
--output results.jsonPrometheus metrics:
| Metric | Description |
|---|---|
turboquant_requests_total |
Requests by endpoint / model / bits |
turboquant_request_latency_seconds |
End-to-end latency histogram |
turboquant_tokens_per_second |
Generation throughput histogram |
turboquant_tokens_generated_total |
Total tokens generated |
turboquant_kv_compression_ratio |
Static compression ratio by bits |
turboquant_gpu_memory_used_gb |
Live GPU memory per device |
turboquant_active_requests |
Concurrency gauge |
turboquant_context_length_tokens |
Input context distribution |
- Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3000(admin / admin)
Pre-provisioned dashboard: throughput, latency, GPU memory, context length.
All settings via environment variables (see docker-compose.yml):
| Variable | Default | Notes |
|---|---|---|
MODEL_NAME |
meta-llama/Llama-3.1-70B-Instruct |
HF model ID or local path |
MODEL_DTYPE |
float16 |
float16 or bfloat16 |
DEVICE_MAP |
auto |
auto, balanced, cuda:0 |
DEFAULT_BITS |
3.5 |
2.5 or 3.5 |
N_OUTLIERS |
32 |
Outlier channels per head |
MAX_CONTEXT_LENGTH |
131072 |
Truncation limit |
API_KEY |
(empty) | Bearer token auth |
HF_TOKEN |
(required) | HuggingFace token |
python tests/smoke_test.py --url http://localhost:8000Why single worker?
The model is loaded once into GPU VRAM. Multiple uvicorn workers would each
load a separate copy, exhausting memory. Concurrency is handled via asyncio
with a generation lock that queues requests — appropriate for GPU inference
where the bottleneck is compute, not I/O.
Calibration
On startup, a short GPFS-domain calibration pass identifies outlier channels
per layer and head. These are fixed for all subsequent inference, ensuring
consistent quant/dequant channel assignments across sequence positions.
Codebook caching
Lloyd-Max codebooks are computed once (several minutes, CPU-bound) and saved
to /data/turbo_codebooks.pt. Subsequent starts load from disk in <1s.
Built on TurboQuant (Zandieh et al., 2025).
Infrastructure by Eviox Tech — HPC & GPU infrastructure.