TurboQuant Inference Server

OpenAI-compatible LLM inference with TurboQuant KV cache compression.
Default: Llama-3.1-70B-Instruct at 3.5-bit compression (~4.57× KV memory reduction, quality-neutral per the paper's LongBench results).

Built by Eviox Tech — HPC & GPU infrastructure.

What it does

	Full-precision	TurboQuant 3.5-bit	TurboQuant 2.5-bit
KV cache per token	16 bits/ch	~3.5 bits/ch	~2.5 bits/ch
Compression ratio	1×	4.57×	6.4×
Quality (LongBench)	baseline	≈ baseline	−0.6 avg pts
Equiv. context @ 80GB A100	~32k	~146k	~205k
Indexing time (d=1536)	—	0.001s (vs 240s PQ)	0.001s

Stack

turboquant-server/
├── app/
│   ├── server.py          # FastAPI — all HTTP endpoints
│   ├── engine.py          # TurboQuantEngine — model load, generation
│   └── config.py          # ServerConfig (env-driven)
├── docker/
│   ├── Dockerfile
│   ├── docker-compose.yml
│   ├── prometheus.yml
│   └── grafana/
│       └── provisioning/  # auto-provisioned dashboard
├── scripts/
│   └── run_benchmark.py   # CLI benchmark tool
├── tests/
│   └── smoke_test.py
├── turboquant_corrected.py # TurboQuant core (corrected implementation)
└── requirements.txt

Quick Start

1. Prerequisites

Docker + NVIDIA Container Toolkit
HuggingFace token with access to meta-llama/Llama-3.1-70B-Instruct
2× A100 80GB (or 4× A100 40GB) for 70B in float16

2. Launch

export HF_TOKEN=hf_your_token_here
export API_KEY=your_secret_key        # optional bearer auth

cd docker
docker compose up -d

First startup: ~10–15 min (model download + codebook precomputation).
Subsequent starts: ~3–5 min (model load from cache).

3. Verify

# Health
curl http://localhost:8000/health

# Ready (waits for model)
curl http://localhost:8000/ready

# Models
curl http://localhost:8000/v1/models

API Endpoints

`POST /v1/chat/completions`

OpenAI-compatible. Drop-in replacement — change only the base URL.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-70b-turbo",
    "messages": [
      {"role": "system",  "content": "You are a GPFS expert."},
      {"role": "user",    "content": "Explain stripe width in GPFS."}
    ],
    "max_tokens": 256,
    "kv_bits": 3.5
  }'

TurboQuant extension: kv_bits (3.5 or 2.5) overrides the server default per-request.
Response includes x_turboquant.kv_compression_ratio in the body.

`POST /v1/completions`

curl http://localhost:8000/v1/completions \
  -d '{"model":"llama-3.1-70b-turbo","prompt":"GPFS stripe width is","max_tokens":50}'

`POST /v1/benchmark`

Run a structured compression benchmark:

curl http://localhost:8000/v1/benchmark \
  -H "Content-Type: application/json" \
  -d '{
    "prompt_tokens":    32768,
    "max_new_tokens":   128,
    "bits_to_compare":  [2.5, 3.5],
    "runs_per_config":  3,
    "task":             "needle"
  }'

Tasks: needle (retrieval), summarize, qa

Or use the CLI:

python scripts/run_benchmark.py \
  --url http://localhost:8000 \
  --task needle \
  --prompt-tokens 32768 \
  --bits 2.5 3.5 \
  --runs 3 \
  --output results.json

`GET /metrics`

Prometheus metrics:

Metric	Description
`turboquant_requests_total`	Requests by endpoint / model / bits
`turboquant_request_latency_seconds`	End-to-end latency histogram
`turboquant_tokens_per_second`	Generation throughput histogram
`turboquant_tokens_generated_total`	Total tokens generated
`turboquant_kv_compression_ratio`	Static compression ratio by bits
`turboquant_gpu_memory_used_gb`	Live GPU memory per device
`turboquant_active_requests`	Concurrency gauge
`turboquant_context_length_tokens`	Input context distribution

Observability

Prometheus: http://localhost:9090
Grafana: http://localhost:3000 (admin / admin)
Pre-provisioned dashboard: throughput, latency, GPU memory, context length.

Configuration

All settings via environment variables (see docker-compose.yml):

Variable	Default	Notes
`MODEL_NAME`	`meta-llama/Llama-3.1-70B-Instruct`	HF model ID or local path
`MODEL_DTYPE`	`float16`	`float16` or `bfloat16`
`DEVICE_MAP`	`auto`	`auto`, `balanced`, `cuda:0`
`DEFAULT_BITS`	`3.5`	`2.5` or `3.5`
`N_OUTLIERS`	`32`	Outlier channels per head
`MAX_CONTEXT_LENGTH`	`131072`	Truncation limit
`API_KEY`	(empty)	Bearer token auth
`HF_TOKEN`	(required)	HuggingFace token

Smoke Tests

python tests/smoke_test.py --url http://localhost:8000

Architecture notes

Why single worker?
The model is loaded once into GPU VRAM. Multiple uvicorn workers would each load a separate copy, exhausting memory. Concurrency is handled via asyncio with a generation lock that queues requests — appropriate for GPU inference where the bottleneck is compute, not I/O.

Calibration
On startup, a short GPFS-domain calibration pass identifies outlier channels per layer and head. These are fixed for all subsequent inference, ensuring consistent quant/dequant channel assignments across sequence positions.

Codebook caching
Lloyd-Max codebooks are computed once (several minutes, CPU-bound) and saved to /data/turbo_codebooks.pt. Subsequent starts load from disk in <1s.

Eviox Tech

Built on TurboQuant (Zandieh et al., 2025).
Infrastructure by Eviox Tech — HPC & GPU infrastructure.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
app		app
docker		docker
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
turboquant_corrected.py		turboquant_corrected.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurboQuant Inference Server

What it does

Stack

Quick Start

1. Prerequisites

2. Launch

3. Verify

API Endpoints

`POST /v1/chat/completions`

`POST /v1/completions`

`POST /v1/benchmark`

`GET /metrics`

Observability

Configuration

Smoke Tests

Architecture notes

Eviox Tech

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TurboQuant Inference Server

What it does

Stack

Quick Start

1. Prerequisites

2. Launch

3. Verify

API Endpoints

POST /v1/chat/completions

POST /v1/completions

POST /v1/benchmark

GET /metrics

Observability

Configuration

Smoke Tests

Architecture notes

Eviox Tech

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`POST /v1/chat/completions`

`POST /v1/completions`

`POST /v1/benchmark`

`GET /metrics`

Packages