Skip to content

tushu1232/turboquant-server

Repository files navigation

TurboQuant Inference Server

OpenAI-compatible LLM inference with TurboQuant KV cache compression.
Default: Llama-3.1-70B-Instruct at 3.5-bit compression (~4.57× KV memory reduction, quality-neutral per the paper's LongBench results).

Built by Eviox Tech — HPC & GPU infrastructure.


What it does

Full-precision TurboQuant 3.5-bit TurboQuant 2.5-bit
KV cache per token 16 bits/ch ~3.5 bits/ch ~2.5 bits/ch
Compression ratio 4.57× 6.4×
Quality (LongBench) baseline ≈ baseline −0.6 avg pts
Equiv. context @ 80GB A100 ~32k ~146k ~205k
Indexing time (d=1536) 0.001s (vs 240s PQ) 0.001s

Stack

turboquant-server/
├── app/
│   ├── server.py          # FastAPI — all HTTP endpoints
│   ├── engine.py          # TurboQuantEngine — model load, generation
│   └── config.py          # ServerConfig (env-driven)
├── docker/
│   ├── Dockerfile
│   ├── docker-compose.yml
│   ├── prometheus.yml
│   └── grafana/
│       └── provisioning/  # auto-provisioned dashboard
├── scripts/
│   └── run_benchmark.py   # CLI benchmark tool
├── tests/
│   └── smoke_test.py
├── turboquant_corrected.py # TurboQuant core (corrected implementation)
└── requirements.txt

Quick Start

1. Prerequisites

  • Docker + NVIDIA Container Toolkit
  • HuggingFace token with access to meta-llama/Llama-3.1-70B-Instruct
  • 2× A100 80GB (or 4× A100 40GB) for 70B in float16

2. Launch

export HF_TOKEN=hf_your_token_here
export API_KEY=your_secret_key        # optional bearer auth

cd docker
docker compose up -d

First startup: ~10–15 min (model download + codebook precomputation).
Subsequent starts: ~3–5 min (model load from cache).

3. Verify

# Health
curl http://localhost:8000/health

# Ready (waits for model)
curl http://localhost:8000/ready

# Models
curl http://localhost:8000/v1/models

API Endpoints

POST /v1/chat/completions

OpenAI-compatible. Drop-in replacement — change only the base URL.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-70b-turbo",
    "messages": [
      {"role": "system",  "content": "You are a GPFS expert."},
      {"role": "user",    "content": "Explain stripe width in GPFS."}
    ],
    "max_tokens": 256,
    "kv_bits": 3.5
  }'

TurboQuant extension: kv_bits (3.5 or 2.5) overrides the server default per-request.
Response includes x_turboquant.kv_compression_ratio in the body.

POST /v1/completions

curl http://localhost:8000/v1/completions \
  -d '{"model":"llama-3.1-70b-turbo","prompt":"GPFS stripe width is","max_tokens":50}'

POST /v1/benchmark

Run a structured compression benchmark:

curl http://localhost:8000/v1/benchmark \
  -H "Content-Type: application/json" \
  -d '{
    "prompt_tokens":    32768,
    "max_new_tokens":   128,
    "bits_to_compare":  [2.5, 3.5],
    "runs_per_config":  3,
    "task":             "needle"
  }'

Tasks: needle (retrieval), summarize, qa

Or use the CLI:

python scripts/run_benchmark.py \
  --url http://localhost:8000 \
  --task needle \
  --prompt-tokens 32768 \
  --bits 2.5 3.5 \
  --runs 3 \
  --output results.json

GET /metrics

Prometheus metrics:

Metric Description
turboquant_requests_total Requests by endpoint / model / bits
turboquant_request_latency_seconds End-to-end latency histogram
turboquant_tokens_per_second Generation throughput histogram
turboquant_tokens_generated_total Total tokens generated
turboquant_kv_compression_ratio Static compression ratio by bits
turboquant_gpu_memory_used_gb Live GPU memory per device
turboquant_active_requests Concurrency gauge
turboquant_context_length_tokens Input context distribution

Observability

  • Prometheus: http://localhost:9090
  • Grafana: http://localhost:3000 (admin / admin)
    Pre-provisioned dashboard: throughput, latency, GPU memory, context length.

Configuration

All settings via environment variables (see docker-compose.yml):

Variable Default Notes
MODEL_NAME meta-llama/Llama-3.1-70B-Instruct HF model ID or local path
MODEL_DTYPE float16 float16 or bfloat16
DEVICE_MAP auto auto, balanced, cuda:0
DEFAULT_BITS 3.5 2.5 or 3.5
N_OUTLIERS 32 Outlier channels per head
MAX_CONTEXT_LENGTH 131072 Truncation limit
API_KEY (empty) Bearer token auth
HF_TOKEN (required) HuggingFace token

Smoke Tests

python tests/smoke_test.py --url http://localhost:8000

Architecture notes

Why single worker?
The model is loaded once into GPU VRAM. Multiple uvicorn workers would each load a separate copy, exhausting memory. Concurrency is handled via asyncio with a generation lock that queues requests — appropriate for GPU inference where the bottleneck is compute, not I/O.

Calibration
On startup, a short GPFS-domain calibration pass identifies outlier channels per layer and head. These are fixed for all subsequent inference, ensuring consistent quant/dequant channel assignments across sequence positions.

Codebook caching
Lloyd-Max codebooks are computed once (several minutes, CPU-bound) and saved to /data/turbo_codebooks.pt. Subsequent starts load from disk in <1s.


Eviox Tech

Built on TurboQuant (Zandieh et al., 2025).
Infrastructure by Eviox Tech — HPC & GPU infrastructure.