[Bug]: --cpu-offload-gb broken on Windows (AssertionError in v0.16.0)

### Your current environment

   ### Environment

  <details>
  <summary>collect_env.py output</summary>

  Collecting environment information...

      System Info
  ==============================
  OS                           : Microsoft Windows 11 Pro
  GCC version                  : Could not collect
  Clang version                : Could not collect
  CMake version                : Could not collect
  Libc version                 : N/A

  ==============================
         PyTorch Info
  ==============================
  PyTorch version              : 2.11.0.dev20260216+cu126
  Is debug build               : False
  CUDA used to build PyTorch   : 12.6
  ROCM used to build PyTorch   : N/A

  ==============================
        Python Environment
  ==============================
  Python version               : 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)] (64-bit
  runtime)
  Python platform              : Windows-11-10.0.26200-SP0

  ==============================
         CUDA / GPU Info
  ==============================
  Is CUDA available            : True
  CUDA runtime version         : 12.6.20
  CUDA_MODULE_LOADING set to   :
  GPU models and configuration : GPU 0: NVIDIA GeForce RTX 3060
  Nvidia driver version        : 591.74
  cuDNN version                : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\cudnn_ops64_9.dll
  HIP runtime version          : N/A
  MIOpen runtime version       : N/A
  Is XNNPACK available         : True

  ==============================
            CPU Info
  ==============================
  Architecture=9
  CurrentClockSpeed=2100
  DeviceID=CPU0
  Family=198
  L2CacheSize=12288
  L2CacheSpeed=
  Manufacturer=GenuineIntel
  MaxClockSpeed=2100
  Name=12th Gen Intel(R) Core(TM) i7-12700F
  ProcessorType=3
  Revision=

  ==============================
  Versions of relevant libraries

  [pip3] flashinfer-python==0.6.3
  [pip3] numpy==2.2.6
  [pip3] nvidia-cudnn-frontend==1.18.0
  [pip3] nvidia-ml-py==13.590.48
  [pip3] pyzmq==27.1.0
  [pip3] torch==2.11.0.dev20260216+cu126
  [pip3] torchaudio==2.11.0.dev20260216+cu126
  [pip3] torchvision==0.26.0.dev20260216+cu126
  [pip3] transformers==4.57.6
  [pip3] triton-windows==3.6.0.post25
  [conda] Could not collect

  ==============================
           vLLM Info
  ==============================
  ROCM Version                 : Could not collect
  vLLM Version                 : 0.16.0
  vLLM Build Flags:
    CUDA Archs: Not Set; ROCm: Disabled
  GPU Topology:
    Could not collect

  ==============================
       Environment Variables
  ==============================
  CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6
  CUDA_PATH_V12_6=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6
  PYTORCH_NVML_BASED_CUDA_CHECK=1
  TORCHINDUCTOR_COMPILE_THREADS=1
  TORCHINDUCTOR_CACHE_DIR=C:\Users\dhite\AppData\Local\Temp\torchinductor_dhite

  </details>
  Related

  - https://github.com/vllm-project/vllm/issues/10267 -- Unified memory support (closed, not planned)
  - https://github.com/vllm-project/vllm/issues/14981 -- RFC: vLLM Windows CUDA support
  - https://github.com/vllm-project/vllm/issues/17611 -- Shared GPU memory overflow (closed, inactivity)

### 🐛 Describe the bug

## Bug: --cpu-offload-gb broken on Windows (AssertionError in v0.16.0)

  ### Description

  The `--cpu-offload-gb` flag, which offloads part of the model weights to system RAM to free VRAM for KV cache, does not
  work on Windows. It fails with an `AssertionError` during initialization.

  This makes it impossible to run VRAM-constrained models (like Voxtral Mini 4B at ~8.4 GiB) effectively on 12GB GPUs.
  Without CPU offload, the model fills nearly all VRAM, leaving only ~400 tokens of KV cache (~0.82 GiB). With working CPU
  offload, even 1-2 GiB offloaded would increase KV cache to ~1120+ tokens and make the model reliable.

  ### Steps to Reproduce

  1. Windows 11 with NVIDIA RTX 3060 (12GB VRAM)
  2. Install vllm-windows v0.16.0
  3. Run:
     ```bash
     vllm serve D:/STT/vllm/models/voxtral-mini-4b \
       --host 0.0.0.0 --port 8000 \
       --gpu-memory-utilization 0.91 \
       --enforce-eager \
       --max-model-len 1024 \
       --cpu-offload-gb 2
  4. Observe AssertionError during startup

  Expected Behavior

  vLLM offloads 2 GiB of model weights to system RAM, freeing ~2 GiB of VRAM for KV cache. The server starts and serves
  requests with a larger KV cache.

  Actual Behavior

  Startup fails with AssertionError. Removing --cpu-offload-gb allows startup but with only ~400 tokens of KV cache, making
  the model unreliable for streaming speech-to-text (Voxtral Realtime).

  Workaround

  Run without --cpu-offload-gb and accept the tight KV cache budget:

  vllm serve D:/STT/vllm/models/voxtral-mini-4b \
    --host 0.0.0.0 --port 8000 \
    --gpu-memory-utilization 0.91 \
    --enforce-eager \
    --max-model-len 1024 \
    --limit-mm-per-prompt '{"audio": 1}'

  This works but limits KV cache to ~0.82 GiB (~400 tokens), which is marginal for real-time audio transcription.

  Impact

  Users with 12GB GPUs cannot run large multimodal models (4B+ parameters) effectively because:
  - Model weights consume ~8.4 GiB
  - Without CPU offload, only ~3.6 GiB remains for KV cache + overhead
  - With --enforce-eager (required to avoid CUDA graph memory), KV cache is further reduced
  - The model works but is unreliable under sustained use

  Working CPU offload would make these models practical on consumer GPUs.


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: --cpu-offload-gb broken on Windows (AssertionError in v0.16.0) #42

Your current environment

Environment

==============================
PyTorch Info

==============================
Python Environment

==============================
CUDA / GPU Info

==============================
CPU Info

==============================
vLLM Info

==============================
Environment Variables

🐛 Describe the bug