diff --git a/benchmark/vllm/README.md b/benchmark/vllm/README.md index 522a104..b4e9782 100644 --- a/benchmark/vllm/README.md +++ b/benchmark/vllm/README.md @@ -61,6 +61,12 @@ The following command pulls the Docker image from Docker Hub. docker pull vllm/vllm-openai-rocm:v0.17.1 ``` +For Gemma 4, use the Gemma4-tagged image (also referenced by [`docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile`](../../docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile)): + +```sh +docker pull vllm/vllm-openai-rocm:gemma4 +``` + ### MAD-integrated benchmarking Clone the ROCm Model Automation and Dashboarding (MAD) repository to a local directory and install the required packages on the host machine. @@ -86,7 +92,17 @@ users can also directly run the vLLm benchmark scripts and change the benchmarki #### Available models >[!NOTE] ->The MXFP4 models are only supported on the gfx950 architecture i.e. MI350X/MI355X accelerators. +>The MXFP4 models are only supported on the gfx950 architecture i.e. MI350X/MI355X accelerators. + +>[!NOTE] +>Gemma 4 models (`pyt_vllm_gemma-4-*`) are built from `vllm/vllm-openai-rocm:gemma4` (see [`docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile`](../../docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile)). Accept Google’s Gemma license on Hugging Face and set `MAD_SECRETS_HFTOKEN` for gated weight downloads. + +Serving recipes for Gemma 4 live in [`scripts/vllm/configs/default.yaml`](../../scripts/vllm/configs/default.yaml). Both Gemma 4 entries use **tensor parallel size 1**, **`TRITON_ATTN`**, **`float16` on gfx942** (via `arch_overrides`), **`--max-model-len` 32768**, text-only multimodal limits (`--limit-mm-per-prompt`), and **`VLLM_ROCM_USE_AITER=1`** where supported. + +| Model | Notes | +| ----- | ----- | +| **google/gemma-4-31B-it** | Dense instruct. Full serving sweep: **`max_concurrency` 1, 8, 32, 128** (four cold starts). | +| **google/gemma-4-26B-A4B-it** | Sparse MoE (“A4B”). **AITER fused MoE is disabled** via **`VLLM_ROCM_USE_AITER_MOE=0`** so MoE runs on the **Triton** path. **Concurrency sweep is narrowed to 1 and 8** only for typical MAD Docker memory limits. | | MAD model name | Model repo | | -------------------------------------- | -------------------------------------- | @@ -112,6 +128,8 @@ users can also directly run the vLLm benchmark scripts and change the benchmarki | pyt_vllm_mixtral-8x22b | [mistralai/Mixtral-8x22B-Instruct-v0.1](https://hugggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) | | pyt_vllm_mixtral-8x22b_fp8 | [amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV](https://hugggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV) | | pyt_vllm_phi-4 | [microsoft/phi-4](https://huggingface.co/microsoft/phi-4) | +| pyt_vllm_gemma-4-26b-a4b-it | [google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it) | +| pyt_vllm_gemma-4-31b-it | [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) | | pyt_vllm_qwen3-8b | [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) | | pyt_vllm_qwen3-32b | [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | | pyt_vllm_qwen3-30b-a3b | [Qwen/Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507) | @@ -132,6 +150,8 @@ docker pull vllm/vllm-openai-rocm:v0.17.1 docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env VLLM_ROCM_USE_AITER=1 --env HUGGINGFACE_HUB_CACHE=/workspace --name test vllm/vllm-openai-rocm:v0.17.1 ``` +For Gemma 4 standalone runs, substitute `vllm/vllm-openai-rocm:gemma4` for the image tag in the `docker run` line above. For **`google/gemma-4-26B-A4B-it`** only, also set **`VLLM_ROCM_USE_AITER_MOE=0`** (same as the MAD `default.yaml` recipe) so MoE does not use AITER’s fused path. + >[!NOTE] >We enable [AITER](https://github.com/ROCm/aiter) during `docker run` via `--env VLLM_ROCM_USE_AITER=1` for best performance >on MI3xx (i.e. gfx942 and gfx950) platforms. If you're using this docker image on other AMD GPUs e.g. MI2xx or Radeon, @@ -345,6 +365,10 @@ owners and are only mentioned for informative purposes.    ---------- This release note summarizes notable changes since the previous docker release. +MAD `pyt_vllm_gemma-4-*` configs (see [`default.yaml`](../../scripts/vllm/configs/default.yaml)): +- **gemma-4-26B-A4B-it:** set `VLLM_ROCM_USE_AITER_MOE=0` (Triton MoE); narrowed default `max_concurrency` to **1 8** to avoid OOM on repeated server restarts. +- **gemma-4-31B-it:** unchanged full sweep **1 8 32 128**; no `VLLM_ROCM_USE_AITER_MOE` override. + v0.17.1 release: - Includes documentation and patches for upstream releases. Please track https://github.com/vllm-project/vllm/releases for all future release notes. diff --git a/docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile b/docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile new file mode 100644 index 0000000..c5876ad --- /dev/null +++ b/docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile @@ -0,0 +1,42 @@ +# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'} +############################################################################### +# +# MIT License +# +# Copyright (c) Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +################################################################################# +# Gemma 4 requires a vLLM build with Gemma4 support; see vLLM recipes (Google/Gemma4.md). +ARG BASE_DOCKER=vllm/vllm-openai-rocm:gemma4 +FROM $BASE_DOCKER + +USER root +ENV WORKSPACE_DIR=/workspace +RUN mkdir -p $WORKSPACE_DIR +WORKDIR $WORKSPACE_DIR + +RUN pip3 install --no-cache-dir "transformers==5.5.0" + +# record configuration for posterity +RUN pip3 list + +# Specify entrypoint to override upstream +ENTRYPOINT [""] diff --git a/models.json b/models.json index b24ac6c..e764c52 100644 --- a/models.json +++ b/models.json @@ -487,6 +487,44 @@ "args": "--model_repo Qwen/Qwen3-8B --config configs/extended.yaml" }, + { + "name": "pyt_vllm_gemma-4-26b-a4b-it", + "data": "huggingface", + "dockerfile": "docker/pyt_vllm_gemma4", + "scripts": "scripts/vllm/run.sh", + "n_gpus": "-1", + "owner": "mad.support@amd.com", + "training_precision": "", + "multiple_results": "perf_gemma-4-26B-A4B-it.csv", + "tags": [ + "pyt", + "vllm", + "vllm_extended", + "inference" + ], + "timeout": -1, + "args": + "--model_repo google/gemma-4-26B-A4B-it --config configs/default.yaml" + }, + { + "name": "pyt_vllm_gemma-4-31b-it", + "data": "huggingface", + "dockerfile": "docker/pyt_vllm_gemma4", + "scripts": "scripts/vllm/run.sh", + "n_gpus": "-1", + "owner": "mad.support@amd.com", + "training_precision": "", + "multiple_results": "perf_gemma-4-31B-it.csv", + "tags": [ + "pyt", + "vllm", + "vllm_extended", + "inference" + ], + "timeout": -1, + "args": + "--model_repo google/gemma-4-31B-it --config configs/default.yaml" + }, { "name": "pyt_vllm_qwen3-32b", "data": "huggingface", diff --git a/scripts/vllm/configs/default.yaml b/scripts/vllm/configs/default.yaml index 038480d..ce09938 100644 --- a/scripts/vllm/configs/default.yaml +++ b/scripts/vllm/configs/default.yaml @@ -92,6 +92,47 @@ VLLM_ROCM_USE_AITER: 1 extra_args: --attention-backend: ROCM_ATTN + arch_overrides: + gfx942: + dtype: float16 + +## Gemma 4: vLLM recipe recommends 1x MI300-class GPU (BF16); tp 1 for text-only bench +## Use TRITON_ATTN (Gemma4 default); 26B-A4B MoE: VLLM_ROCM_USE_AITER_MOE=0; narrow concurrency for 26B to avoid OOM +- benchmark: serving + model: google/gemma-4-26B-A4B-it + tp: 1 + inp: 1024 + out: 1024 + dtype: auto + max_concurrency: 1 8 + env: + VLLM_ROCM_USE_AITER: 1 + VLLM_ROCM_USE_AITER_MOE: 0 + extra_args: + --attention-backend: TRITON_ATTN + --max-model-len: 32768 + --gpu-memory-utilization: 0.90 + --limit-mm-per-prompt: '{"image":0,"audio":0}' + --async-scheduling: True + arch_overrides: + gfx942: + dtype: float16 + +- benchmark: serving + model: google/gemma-4-31B-it + tp: 1 + inp: 1024 + out: 1024 + dtype: auto + max_concurrency: 1 8 32 128 + env: + VLLM_ROCM_USE_AITER: 1 + extra_args: + --attention-backend: TRITON_ATTN + --max-model-len: 32768 + --gpu-memory-utilization: 0.90 + --limit-mm-per-prompt: '{"image":0,"audio":0}' + --async-scheduling: True arch_overrides: gfx942: dtype: float16 \ No newline at end of file diff --git a/scripts/vllm/run_vllm.py b/scripts/vllm/run_vllm.py index b5b5db0..3d20c08 100644 --- a/scripts/vllm/run_vllm.py +++ b/scripts/vllm/run_vllm.py @@ -34,6 +34,7 @@ import signal import argparse import itertools +import shlex import subprocess from typing import List, Dict @@ -490,7 +491,16 @@ def main(): if isinstance(v, bool): extra_args_str += f" {k}" else: - extra_args_str += f" {k} {v}" + s = str(v) + st = s.strip() + if ( + k == "--limit-mm-per-prompt" + or (st[:1] in "{[") + or any(ch.isspace() for ch in s) + ): + extra_args_str += f" {k} {shlex.quote(s)}" + else: + extra_args_str += f" {k} {v}" config["env"] = env_vars_str config["extra_args"] = extra_args_str