vLLM-KIVI: Production-Ready 4-bit KV Cache Quantization for vLLM

Note

Based on the paper: KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache (ICLR 2024).

vLLM-KIVI is a high-performance quantization plugin for vLLM, implementing the KIVI algorithm using OpenAI Triton. It reduces KV cache memory consumption by 75% (4-bit vs FP16), enabling 4x larger batch sizes or 4x longer context lengths.

🚀 A100 80GB Benchmark (DeepSeek-R1-Distill-Llama-8B)

Tested on DeepSeek-R1-8B (GQA: 32 heads, 8 KV heads).

1. Concurrency & Capacity (Memory Limit)

Context: 2048, GPU Utilization: 80%

Configuration	KV Cache Size (Tokens)	Capacity Gain
vLLM FP16 (Baseline)	385,488	1.00x
vLLM FP8 (Standard)	770,992	2.00x
vLLM + KIVI 4-bit	1,413,040	🏆 3.67x

2. Throughput & Latency

Batch: 16, Decode: 50 tokens

Configuration	Throughput (tok/s)	Speedup (vs FP16)	TPOT (Latency)
vLLM FP16	597.48	1.00x	1.67 ms
vLLM FP8	727.29	1.22x	1.37 ms
vLLM + KIVI 4-bit	728.92	🚀 1.22x	✅ 1.37 ms

3. Precision (Perplexity)

Text sample evaluation

Configuration	Perplexity (PPL)	Accuracy Drop
vLLM FP16 (Baseline)	6.2861	0.00%
vLLM FP8	6.3007	0.23%
vLLM + KIVI 4-bit	6.4489	2.59%

💡 Technical Solution

To achieve performance and capacity surpassing vLLM's native FP8, we implemented several advanced optimizations:

1. Dimension Folding (4x Capacity)

Standard vLLM allocators support 1-byte elements (fp8/int8). To reach 4-bit (0.5 bytes), we implement Dimension Folding. We tell vLLM the head_dim is halved (e.g., 128 -> 64) while using uint8. This tricks the engine into allocating 4x more tokens in the same HBM footprint. Our Triton kernels then transparently unpack the folded dimensions during computation.

2. Aggressive Cross-Process Injection

vLLM v1 uses a spawn-based multi-process architecture. We implemented a module-level hijacking mechanism that injects KIVI logic directly into the distributed Worker processes, ensuring that our custom kernels are active across all GPU ranks without requiring vLLM source code modification.

3. GQA & BFloat16 Support

Optimized for modern 8B+ models (like Llama-3 and DeepSeek-R1) which use Grouped Query Attention. The backend automatically discovers model parameters and handles asymmetrical head counts (e.g., 32 Q heads vs 8 KV heads) during 4-bit quantization and reconstruction.

4. Triton-Fused Attention Kernel

A specialized A100 Triton kernel that performs on-the-fly 4-bit dequantization fused with paged attention. By performing the unpacking in registers during data loading, we hide the compute latency and leverage A100's HBM bandwidth more efficiently than standard FP16/FP8 backends.

🛠️ Usage

Environment Setup

Requires vLLM v0.16.0+, Torch 2.4+, and Triton 3.0+.

export PYTHONPATH=.:$PYTHONPATH

Running Benchmarks

# 1. Evaluate Accuracy (PPL)
python3.12 bench/eval_ppl.py --model /path/to/DeepSeek-R1-8B

# 2. Evaluate Capacity (Tokens)
python3.12 bench/simulate_concurrency_limit.py --model /path/to/DeepSeek-R1-8B

# 3. Evaluate Throughput (tok/s)
python3.12 bench/bench_e2e_showdown.py --model /path/to/DeepSeek-R1-8B

📜 License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bench		bench
examples		examples
kernels		kernels
tests		tests
vllm_adapter		vllm_adapter
.gitignore		.gitignore
DEVELOPMENT.md		DEVELOPMENT.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM-KIVI: Production-Ready 4-bit KV Cache Quantization for vLLM

🚀 A100 80GB Benchmark (DeepSeek-R1-Distill-Llama-8B)

1. Concurrency & Capacity (Memory Limit)

2. Throughput & Latency

3. Precision (Perplexity)

💡 Technical Solution

1. Dimension Folding (4x Capacity)

2. Aggressive Cross-Process Injection

3. GQA & BFloat16 Support

4. Triton-Fused Attention Kernel

🛠️ Usage

Environment Setup

Running Benchmarks

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM-KIVI: Production-Ready 4-bit KV Cache Quantization for vLLM

🚀 A100 80GB Benchmark (DeepSeek-R1-Distill-Llama-8B)

1. Concurrency & Capacity (Memory Limit)

2. Throughput & Latency

3. Precision (Perplexity)

💡 Technical Solution

1. Dimension Folding (4x Capacity)

2. Aggressive Cross-Process Injection

3. GQA & BFloat16 Support

4. Triton-Fused Attention Kernel

🛠️ Usage

Environment Setup

Running Benchmarks

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages