GitHub - netpreme/vllm_xmem: vLLM with X-Mem support for KV offloading

vLLM meets Netpreme's scale-up GPU memory expansion

🔥 We have built a prototype platform to enable developers and researchers to explore the use cases of scale-up GPU memory expansion. Please contact us to get access to it.

About

This repository is a fork of vLLM that integrates Netpreme’s scale-up GPU memory expansion system, X-Mem, as a dedicated tier for KV cache storage. By replacing traditional CPU DRAM with X-Mem in the KV offloading module, we leverage ~10x higher bandwidth to bypass standard memory bottlenecks. This allows an inference engine to reduce Time to First Token (TTFT) and achieve higher throughput and concurrency for KV-intensive workloads, such as multi-turn coding agents.

Getting Started

Install vLLM+X-Mem from source

uv venv --python 3.12
git clone <url-to-repo>
cd vllm_xmem
uv pip install -e .

Note

vLLM+X-Mem will only work on Netpreme's X-Mem VMs. Please contact us to get access to it.

Usage

To replace CPU DRAM with X-Mem,

vllm serve CLI API: add --kv-offloading-mtier flag

LLM() Python API: set the kv_offload_mtier argument to True.
e.g.,

ktc = KVTransferConfig(
  kv_connector="OffloadingConnector",
  kv_role="kv_both",
  # The below config will be applied to X-Mem 
  #   instead of CPU DRAM if X-Mem is enabled
  kv_connector_extra_config={
      "block_size": CPU_BLOCK_SIZE,
      "cpu_bytes_to_use": cpu_bytes_to_use,
      "num_cpu_blocks": num_cpu_blocks, 
  }
)
llm = LLM(
    model=MODEL,
    block_size=GPU_BLOCK_SIZE,
    ...
    enable_prefix_caching=True,
    kv_transfer_config=ktc,
    kv_offload_mtier=True,
)

📊 Performance Benchmarks

We evaluated the integration of X-Mem into vLLM's KV offload connector, comparing it against a CPU DRAM baseline. We used the benchmark code used in the vLLM blog post on KV offloading (blog, code).

Setup

Hardware: Single H200 GPU with 128GB of dedicated KV offloading memory (either Host DRAM or X-Mem).
Model: Qwen3-30B-A3B-Instruct-2507 (96 KB per-token KV cache).
Configuration: GPU KV cache disabled (to isolate offloading), block size = 128 tokens, 10000 prefill requests of 512 tokens are used for the throughput experiment.

Key Results

X-Mem shows significant gains over CPU DRAM for high cache-hit rate workloads, where data copying dominates computation:

🚀 Time to First Token (TTFT): ~4× faster than CPU DRAM.

* ⚡ Throughput: ~3× higher input token throughput.

Running the benchmark script

To reprodcue results, we provide the benchmark script (kvcache_benchmark.py) based on the benchmark used in the vLLM blog (blog, code).

TTFT: python kvcache_benchmark.py --ttft
throughput: python kvcache_benchmark.py --tput

Roadmap

Optimize implementation to close the gap with the ~450 GB/s theoretical bandwidth limit.
Expand benchmark across various block sizes and prompt lengths.
Benchmark using production traces (e.g., Mooncake Trace) with the GPU KV cache enabled to represent real-world workloads.

Name		Name	Last commit message	Last commit date
Latest commit History 15,152 Commits
.buildkite		.buildkite
.gemini		.gemini
.github		.github
assets		assets
benchmarks		benchmarks
cmake		cmake
csrc		csrc
docker		docker
docs		docs
examples		examples
requirements		requirements
scripts		scripts
tests		tests
tools		tools
vllm		vllm
.clang-format		.clang-format
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
.shellcheckrc		.shellcheckrc
.yapfignore		.yapfignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml
kvcache_benchmark.py		kvcache_benchmark.py
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
setup.py		setup.py
use_existing_torch.py		use_existing_torch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM meets Netpreme's scale-up GPU memory expansion

About

Getting Started

Usage

📊 Performance Benchmarks

Setup

Key Results

Running the benchmark script

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM meets Netpreme's scale-up GPU memory expansion

About

Getting Started

Usage

📊 Performance Benchmarks

Setup

Key Results

Running the benchmark script

Roadmap

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages