Skip to content

[Bug]: --cpu-offload-gb broken on Windows (AssertionError in v0.16.0) #42

@ghbaud

Description

@ghbaud

Your current environment

Environment

collect_env.py output

Collecting environment information...

  System Info

==============================
OS : Microsoft Windows 11 Pro
GCC version : Could not collect
Clang version : Could not collect
CMake version : Could not collect
Libc version : N/A

==============================
PyTorch Info

PyTorch version : 2.11.0.dev20260216+cu126
Is debug build : False
CUDA used to build PyTorch : 12.6
ROCM used to build PyTorch : N/A

==============================
Python Environment

Python version : 3.12.10 (tags/v3.12.10:0cc8128, Apr 8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)] (64-bit
runtime)
Python platform : Windows-11-10.0.26200-SP0

==============================
CUDA / GPU Info

Is CUDA available : True
CUDA runtime version : 12.6.20
CUDA_MODULE_LOADING set to :
GPU models and configuration : GPU 0: NVIDIA GeForce RTX 3060
Nvidia driver version : 591.74
cuDNN version : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\cudnn_ops64_9.dll
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True

==============================
CPU Info

Architecture=9
CurrentClockSpeed=2100
DeviceID=CPU0
Family=198
L2CacheSize=12288
L2CacheSpeed=
Manufacturer=GenuineIntel
MaxClockSpeed=2100
Name=12th Gen Intel(R) Core(TM) i7-12700F
ProcessorType=3
Revision=

==============================
Versions of relevant libraries

[pip3] flashinfer-python==0.6.3
[pip3] numpy==2.2.6
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-ml-py==13.590.48
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0.dev20260216+cu126
[pip3] torchaudio==2.11.0.dev20260216+cu126
[pip3] torchvision==0.26.0.dev20260216+cu126
[pip3] transformers==4.57.6
[pip3] triton-windows==3.6.0.post25
[conda] Could not collect

==============================
vLLM Info

ROCM Version : Could not collect
vLLM Version : 0.16.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
Could not collect

==============================
Environment Variables

CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6
CUDA_PATH_V12_6=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=C:\Users\dhite\AppData\Local\Temp\torchinductor_dhite

Related

🐛 Describe the bug

Bug: --cpu-offload-gb broken on Windows (AssertionError in v0.16.0)

Description

The --cpu-offload-gb flag, which offloads part of the model weights to system RAM to free VRAM for KV cache, does not
work on Windows. It fails with an AssertionError during initialization.

This makes it impossible to run VRAM-constrained models (like Voxtral Mini 4B at ~8.4 GiB) effectively on 12GB GPUs.
Without CPU offload, the model fills nearly all VRAM, leaving only ~400 tokens of KV cache (~0.82 GiB). With working CPU
offload, even 1-2 GiB offloaded would increase KV cache to ~1120+ tokens and make the model reliable.

Steps to Reproduce

  1. Windows 11 with NVIDIA RTX 3060 (12GB VRAM)
  2. Install vllm-windows v0.16.0
  3. Run:
    vllm serve D:/STT/vllm/models/voxtral-mini-4b \
      --host 0.0.0.0 --port 8000 \
      --gpu-memory-utilization 0.91 \
      --enforce-eager \
      --max-model-len 1024 \
      --cpu-offload-gb 2
  4. Observe AssertionError during startup

Expected Behavior

vLLM offloads 2 GiB of model weights to system RAM, freeing ~2 GiB of VRAM for KV cache. The server starts and serves
requests with a larger KV cache.

Actual Behavior

Startup fails with AssertionError. Removing --cpu-offload-gb allows startup but with only ~400 tokens of KV cache, making
the model unreliable for streaming speech-to-text (Voxtral Realtime).

Workaround

Run without --cpu-offload-gb and accept the tight KV cache budget:

vllm serve D:/STT/vllm/models/voxtral-mini-4b
--host 0.0.0.0 --port 8000
--gpu-memory-utilization 0.91
--enforce-eager
--max-model-len 1024
--limit-mm-per-prompt '{"audio": 1}'

This works but limits KV cache to ~0.82 GiB (~400 tokens), which is marginal for real-time audio transcription.

Impact

Users with 12GB GPUs cannot run large multimodal models (4B+ parameters) effectively because:

  • Model weights consume ~8.4 GiB
  • Without CPU offload, only ~3.6 GiB remains for KV cache + overhead
  • With --enforce-eager (required to avoid CUDA graph memory), KV cache is further reduced
  • The model works but is unreliable under sustained use

Working CPU offload would make these models practical on consumer GPUs.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions