Note
Based on the paper: KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache (ICLR 2024).
vLLM-KIVI is a high-performance quantization plugin for vLLM, implementing the KIVI algorithm using OpenAI Triton. It reduces KV cache memory consumption by 75% (4-bit vs FP16), enabling 4x larger batch sizes or 4x longer context lengths.
Tested on DeepSeek-R1-8B (GQA: 32 heads, 8 KV heads).
Context: 2048, GPU Utilization: 80%
| Configuration | KV Cache Size (Tokens) | Capacity Gain |
|---|---|---|
| vLLM FP16 (Baseline) | 385,488 | 1.00x |
| vLLM FP8 (Standard) | 770,992 | 2.00x |
| vLLM + KIVI 4-bit | 1,413,040 | 🏆 3.67x |
Batch: 16, Decode: 50 tokens
| Configuration | Throughput (tok/s) | Speedup (vs FP16) | TPOT (Latency) |
|---|---|---|---|
| vLLM FP16 | 597.48 | 1.00x | 1.67 ms |
| vLLM FP8 | 727.29 | 1.22x | 1.37 ms |
| vLLM + KIVI 4-bit | 728.92 | 🚀 1.22x | ✅ 1.37 ms |
Text sample evaluation
| Configuration | Perplexity (PPL) | Accuracy Drop |
|---|---|---|
| vLLM FP16 (Baseline) | 6.2861 | 0.00% |
| vLLM FP8 | 6.3007 | 0.23% |
| vLLM + KIVI 4-bit | 6.4489 | 2.59% |
To achieve performance and capacity surpassing vLLM's native FP8, we implemented several advanced optimizations:
Standard vLLM allocators support 1-byte elements (fp8/int8). To reach 4-bit (0.5 bytes), we implement Dimension Folding. We tell vLLM the head_dim is halved (e.g., 128 -> 64) while using uint8. This tricks the engine into allocating 4x more tokens in the same HBM footprint. Our Triton kernels then transparently unpack the folded dimensions during computation.
vLLM v1 uses a spawn-based multi-process architecture. We implemented a module-level hijacking mechanism that injects KIVI logic directly into the distributed Worker processes, ensuring that our custom kernels are active across all GPU ranks without requiring vLLM source code modification.
Optimized for modern 8B+ models (like Llama-3 and DeepSeek-R1) which use Grouped Query Attention. The backend automatically discovers model parameters and handles asymmetrical head counts (e.g., 32 Q heads vs 8 KV heads) during 4-bit quantization and reconstruction.
A specialized A100 Triton kernel that performs on-the-fly 4-bit dequantization fused with paged attention. By performing the unpacking in registers during data loading, we hide the compute latency and leverage A100's HBM bandwidth more efficiently than standard FP16/FP8 backends.
Requires vLLM v0.16.0+, Torch 2.4+, and Triton 3.0+.
export PYTHONPATH=.:$PYTHONPATH# 1. Evaluate Accuracy (PPL)
python3.12 bench/eval_ppl.py --model /path/to/DeepSeek-R1-8B
# 2. Evaluate Capacity (Tokens)
python3.12 bench/simulate_concurrency_limit.py --model /path/to/DeepSeek-R1-8B
# 3. Evaluate Throughput (tok/s)
python3.12 bench/bench_e2e_showdown.py --model /path/to/DeepSeek-R1-8BApache License 2.0