GitHub - netpreme/sglang_xmem: SGLang with X-Mem support for KV offloading

SGLang meets Netpreme's scale-up GPU memory expansion

🔥 We have built a prototype platform to enable developers and researchers to explore the use cases of scale-up GPU memory expansion. Please contact us to get access to it.

About

This repository is a fork of SGLang that integrates Netpreme's scale-up GPU memory expansion system, X-Mem, as a dedicated tier for KV cache storage. By replacing traditional CPU DRAM with X-Mem in the hierarchical KV cache (HiCache), we leverage ~10x higher bandwidth to reduce Time to First Token (TTFT) and achieve higher throughput for KV-intensive workloads, such as multi-turn coding agents.

Getting Started

uv venv --python 3.12
source .venv/bin/activate
git clone <url-to-repo>
cd sglang_xmem

# Install sglang
cd python
uv pip install -e .

# Install sgl-kernel (requires CUDA toolkit)
cd ../sgl-kernel
apt-get install -y libnuma-dev
export PATH=/usr/local/cuda/bin:$PATH
MAX_JOBS=10 CMAKE_ARGS="-DCMAKE_POLICY_VERSION_MINIMUM=3.5" uv pip install -e . --no-build-isolation

# Install X-Mem
uv pip install xmem-mtier==0.1.2

Note

SGLang+X-Mem will only work on Netpreme's X-Mem VMs. Please contact us to get access to it.

Usage

To replace CPU DRAM with X-Mem, add --hicache-use-xmem to the launch command:

python -m sglang.launch_server \
    --model Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 \
    --enable-hierarchical-cache \
    --hicache-size 60 \
    --hicache-write-policy write_through \
    --hicache-io-backend kernel \
    --hicache-mem-layout layer_first \
    --page-size 64 \
    --mem-fraction-static 0.5 \
    --hicache-use-xmem

Or via Python API:

import sglang as sgl

llm = sgl.Engine(
    model_path="Qwen/Qwen3-30B-A3B-Instruct-2507-FP8",
    page_size=64,
    mem_fraction_static=0.5,
    enable_hierarchical_cache=True,
    hicache_size=60,
    hicache_write_policy="write_through",
    hicache_io_backend="kernel",
    hicache_mem_layout="layer_first",
    hicache_use_xmem=True,
)

Use --no-hicache-use-xmem to fall back to CPU DRAM.

📊 Performance Benchmarks

We evaluated the integration of X-Mem into SGLang's hierarchical KV cache, comparing it against a CPU DRAM baseline. We used a benchmark methodology based on the vLLM blog post on KV offloading (blog, code).

Setup

Hardware: Single H100 GPU with 60GB of dedicated KV offloading memory (either Host DRAM or X-Mem).
Model: Qwen3-30B-A3B-Instruct-2507-FP8 (96 KB per-token KV cache).
Configuration: page_size=64 tokens, layer_first layout, write_through policy.

Key Results

X-Mem shows significant gains over CPU DRAM for high cache-hit rate workloads, where data copying dominates computation:

🚀 Time to First Token (TTFT): ~6.7× faster than CPU DRAM at 80K tokens.

Running the benchmark

# TTFT benchmark
python kvcache_benchmark.py --ttft --backend both

# X-Mem only
python kvcache_benchmark.py --ttft --backend xmem

Name		Name	Last commit message	Last commit date
Latest commit History 9,883 Commits
.claude/skills		.claude/skills
.devcontainer		.devcontainer
.github		.github
3rdparty/amd		3rdparty/amd
assets		assets
benchmark		benchmark
docker		docker
docs		docs
examples		examples
python		python
scripts		scripts
sgl-kernel		sgl-kernel
sgl-model-gateway		sgl-model-gateway
test		test
.codespellrc		.codespellrc
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SGLANG_README.md		SGLANG_README.md
kvcache_benchmark.py		kvcache_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SGLang meets Netpreme's scale-up GPU memory expansion

About

Getting Started

Usage

📊 Performance Benchmarks

Setup

Key Results

Running the benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SGLang meets Netpreme's scale-up GPU memory expansion

About

Getting Started

Usage

📊 Performance Benchmarks

Setup

Key Results

Running the benchmark

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages