From c6531d2fc2a5c7b68c2ad280817731a5c4546c5b Mon Sep 17 00:00:00 2001 From: zRzRzRzRzRzRzR <2448370773@qq.com> Date: Thu, 14 Aug 2025 19:57:37 +0800 Subject: [PATCH 1/8] glm45 blog Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com> --- .gitignore | 6 ++ _posts/2025-08-15-glm45-vllm.md | 103 ++++++++++++++++++++++++++++++++ 2 files changed, 109 insertions(+) create mode 100644 _posts/2025-08-15-glm45-vllm.md diff --git a/.gitignore b/.gitignore index d96f072..566b326 100644 --- a/.gitignore +++ b/.gitignore @@ -20,3 +20,9 @@ Gemfile.lock .Trashes ehthumbs.db Thumbs.db + +# IDE and venv +.idea +.vscode +.venv +venv diff --git a/_posts/2025-08-15-glm45-vllm.md b/_posts/2025-08-15-glm45-vllm.md new file mode 100644 index 0000000..ce4cfcd --- /dev/null +++ b/_posts/2025-08-15-glm45-vllm.md @@ -0,0 +1,103 @@ +--- +layout: post +title: "Use vLLM to speed " +author: "Yuxuan Zhang" +image: /assets/logos/vllm-logo-text-light.png +--- + +# Introduction + +The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total +parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total +parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities +to meet the complex demands of intelligent agent applications. + +Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and +tool usage, and non-thinking mode for immediate responses. + +As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional +performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, +GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. + +![bench_45](https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/bench.png) + +GLM-4.5V is based on GLM-4.5-Air. It continues the technical approach of GLM-4.1V-Thinking, achieving SOTA performance +among models of the same scale on 42 public vision-language benchmarks. + +![bench_45v](https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/resources/bench_45v.jpeg) + +To get more information about GLM-4.5 and GLM-4.5V, please refer to the [GLM-4.5](https://github.com/zai-org/GLM-4.5) +and [GLM-V](https://github.com/zai-org/GLM-V). + +this blog will guide users on how to use vLLM to accelerate inference for the GLM-4.5V and GLM-4.5 model series on +NVIDIA Blackwell and Hopper GPUs. + +## Installation + +In the latest vLLM main branch, both the GLM-4.5V and GLM-4.5 model series are supported. +You can install the nightly version and manually update transformers to enable model support. + +```shell +pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly +pip install transformers-v4.55.0-GLM-4.5V-preview +``` + +## Usage + +GLM-4.5 and GLM-4.5V both offer FP8 and BF16 precision models. +In vLLM, you can use the same command to run inference for either precision. + +For the GLM-4.5 model, you can start the service with the following command: + +```shell +vllm serve zai-org/GLM-4.5-Air \ + --tensor-parallel-size 4 \ + --tool-call-parser glm45 \ + --reasoning-parser glm45 \ + --enable-auto-tool-choice +``` + +For the GLM-4.5V model, you can start the service with the following command: + +```shell +vllm serve zai-org/GLM-4.5V \ + --tensor-parallel-size 4 \ + --tool-call-parser glm45 \ + --reasoning-parser glm45 \ + --enable-auto-tool-choice \ + --allowed-local-media-path / \ + --media-io-kwargs '{"video": {"num_frames": -1}}' +``` + +## Important Notes + ++ The reasoning part of the model output will be wrapped in `reasoning_content`. `content` will only contain the final + answer. To disable reasoning, add the following parameter: + `extra_body={"chat_template_kwargs": {"enable_thinking": False}}` ++ If you're using 8x H100 GPUs and encounter insufficient memory when running the GLM-4.5 model, you'll need + `--cpu-offload-gb 16`. ++ If you encounter `flash infer` issues, use `VLLM_ATTENTION_BACKEND=XFORMERS` as a temporary replacement. You can also + specify `TORCH_CUDA_ARCH_LIST='9.0+PTX'` to use `flash infer`, different GPUs have different TORCH_CUDA_ARCH_LIST + values, please check accordingly. ++ vllm v0 is not support our model. + +### Grounding in GLM-4.5V + +GLM-4.5V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, GLM-4.5V +is able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports +complex descriptions of the target object as well as specified output formats, for example: +> +> - Help me to locate in the image and give me its bounding boxes. +> - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description. + +Here, `` is the description of the target object. The output bounding box is a quadruple $$[x_1,y_1,x_2,y_2]$$ +composed of the coordinates of the top-left and bottom-right corners, where each value is normalized by the image +width (for x) or height (for y) and scaled by 1000. + +In the response, the special tokens `<|begin_of_box|>` and `<|end_of_box|>` are used to mark the image bounding box in +the answer. The bracket style may vary ([], [[]], (), <>, etc.), but the meaning is the same: to enclose the coordinates +of the box. + +## Acknowledgement + +vLLM team members who contributed to this effort are: Simon Mo, Kaichao You. From 0a53f41d0c5f641ba19911ff8468bfed6b71fc57 Mon Sep 17 00:00:00 2001 From: zRzRzRzRzRzRzR <2448370773@qq.com> Date: Fri, 15 Aug 2025 15:33:25 +0800 Subject: [PATCH 2/8] only Acknowledgement remain Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com> --- _posts/2025-08-15-glm45-vllm.md | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/_posts/2025-08-15-glm45-vllm.md b/_posts/2025-08-15-glm45-vllm.md index ce4cfcd..4ead5ce 100644 --- a/_posts/2025-08-15-glm45-vllm.md +++ b/_posts/2025-08-15-glm45-vllm.md @@ -1,11 +1,11 @@ --- layout: post -title: "Use vLLM to speed " +title: "Use vLLM to deploy GLM-4.5 and GLM-4.5V model" author: "Yuxuan Zhang" image: /assets/logos/vllm-logo-text-light.png --- -# Introduction +# Model Introduction The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total @@ -98,6 +98,16 @@ In the response, the special tokens `<|begin_of_box|>` and `<|end_of_box|>` are the answer. The bracket style may vary ([], [[]], (), <>, etc.), but the meaning is the same: to enclose the coordinates of the box. +## Cooperation with vLLM and Z.ai Team + +During the release of the GLM-4.5 and GLM-4.5V models, the vLLM team worked closely with the Z.ai team, providing +extensive support in addressing issues related to the model launch. +The GLM-4.5 and GLM-4.5V models provided by the Z.ai team were modified in the vLLM implementation PR, including (but +not limited to) resolving [CUDA Core Dump](./2025-08-11-cuda-debugging.md) debugging issues and FP8 model accuracy +alignment problems. +They also ensured that the vLLM `main` branch had full support for the open-source GLM-4.5 series before the models were +released. + ## Acknowledgement -vLLM team members who contributed to this effort are: Simon Mo, Kaichao You. +We would like to thank the vLLM team members who contributed to this effort are: Simon Mo, Kaichao You. From fc55b114b4b083adb2e250172f91270dfacc9453 Mon Sep 17 00:00:00 2001 From: zRzRzRzRzRzRzR <2448370773@qq.com> Date: Sat, 16 Aug 2025 18:11:46 +0800 Subject: [PATCH 3/8] rollback Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com> --- .gitignore | 6 ------ 1 file changed, 6 deletions(-) diff --git a/.gitignore b/.gitignore index 566b326..d96f072 100644 --- a/.gitignore +++ b/.gitignore @@ -20,9 +20,3 @@ Gemfile.lock .Trashes ehthumbs.db Thumbs.db - -# IDE and venv -.idea -.vscode -.venv -venv From 2922fec26b21722e56f3300a23c6a8b1e37a4117 Mon Sep 17 00:00:00 2001 From: zRzRzRzRzRzRzR <2448370773@qq.com> Date: Sat, 16 Aug 2025 18:16:03 +0800 Subject: [PATCH 4/8] changed # Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com> --- _posts/2025-08-15-glm45-vllm.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/_posts/2025-08-15-glm45-vllm.md b/_posts/2025-08-15-glm45-vllm.md index 4ead5ce..f8e0bc6 100644 --- a/_posts/2025-08-15-glm45-vllm.md +++ b/_posts/2025-08-15-glm45-vllm.md @@ -5,7 +5,9 @@ author: "Yuxuan Zhang" image: /assets/logos/vllm-logo-text-light.png --- -# Model Introduction +# Use vLLM to deploy GLM-4.5 and GLM-4.5V model + +## Model Introduction The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total @@ -69,7 +71,7 @@ vllm serve zai-org/GLM-4.5V \ --media-io-kwargs '{"video": {"num_frames": -1}}' ``` -## Important Notes +### Important Notes + The reasoning part of the model output will be wrapped in `reasoning_content`. `content` will only contain the final answer. To disable reasoning, add the following parameter: @@ -86,7 +88,7 @@ vllm serve zai-org/GLM-4.5V \ GLM-4.5V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, GLM-4.5V is able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports complex descriptions of the target object as well as specified output formats, for example: -> + > - Help me to locate in the image and give me its bounding boxes. > - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description. From a3bdea8c738bd67b2ff8c2068680df0e2445e899 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Mon, 18 Aug 2025 16:29:26 +0800 Subject: [PATCH 5/8] polish Signed-off-by: youkaichao --- _posts/2025-08-15-glm45-vllm.md | 38 +++++++++++++++------------------ 1 file changed, 17 insertions(+), 21 deletions(-) diff --git a/_posts/2025-08-15-glm45-vllm.md b/_posts/2025-08-15-glm45-vllm.md index f8e0bc6..6b15407 100644 --- a/_posts/2025-08-15-glm45-vllm.md +++ b/_posts/2025-08-15-glm45-vllm.md @@ -1,15 +1,15 @@ --- layout: post -title: "Use vLLM to deploy GLM-4.5 and GLM-4.5V model" +title: "GLM-4.5 Meets vLLM: Built for Intelligent Agents" author: "Yuxuan Zhang" image: /assets/logos/vllm-logo-text-light.png --- -# Use vLLM to deploy GLM-4.5 and GLM-4.5V model +## Introduction -## Model Introduction +[General Language Model (GLM)](https://aclanthology.org/2022.acl-long.26/) is a family of foundation models created by Zhipu.ai (now renamed to [Z.ai](https://z.ai/)). The GLM team has long-term collaboration with vLLM team, dating back to the early days of vLLM and the popular [ChatGLM model series](https://github.com/zai-org/ChatGLM-6B). Recently, the GLM team released the GLM-4.5 and GLM-4.5V model series, which are designed for intelligent agents. They are the top trending models in Huggingface model hub right now. -The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total +GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. @@ -28,10 +28,10 @@ among models of the same scale on 42 public vision-language benchmarks. ![bench_45v](https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/resources/bench_45v.jpeg) -To get more information about GLM-4.5 and GLM-4.5V, please refer to the [GLM-4.5](https://github.com/zai-org/GLM-4.5) +To get more information about GLM-4.5 and GLM-4.5V, please refer to [GLM-4.5](https://github.com/zai-org/GLM-4.5) and [GLM-V](https://github.com/zai-org/GLM-V). -this blog will guide users on how to use vLLM to accelerate inference for the GLM-4.5V and GLM-4.5 model series on +This blog will guide users on how to use vLLM to accelerate inference for the GLM-4.5V and GLM-4.5 model series on NVIDIA Blackwell and Hopper GPUs. ## Installation @@ -78,19 +78,19 @@ vllm serve zai-org/GLM-4.5V \ `extra_body={"chat_template_kwargs": {"enable_thinking": False}}` + If you're using 8x H100 GPUs and encounter insufficient memory when running the GLM-4.5 model, you'll need `--cpu-offload-gb 16`. -+ If you encounter `flash infer` issues, use `VLLM_ATTENTION_BACKEND=XFORMERS` as a temporary replacement. You can also - specify `TORCH_CUDA_ARCH_LIST='9.0+PTX'` to use `flash infer`, different GPUs have different TORCH_CUDA_ARCH_LIST ++ If you encounter `flash_infer` issues, use `VLLM_ATTENTION_BACKEND=XFORMERS` as a temporary replacement. You can also + specify `TORCH_CUDA_ARCH_LIST='9.0+PTX'` to use `flash_infer`, different GPUs have different TORCH_CUDA_ARCH_LIST values, please check accordingly. -+ vllm v0 is not support our model. ++ vLLM v0 is not support our model. ### Grounding in GLM-4.5V GLM-4.5V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, GLM-4.5V is able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports -complex descriptions of the target object as well as specified output formats, for example: +complex descriptions of the target object as well as specified output formats. Example prompts are: -> - Help me to locate in the image and give me its bounding boxes. -> - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description. +- Help me to locate `` in the image and give me its bounding boxes. +- Please pinpoint the bounding box `[[x1,y1,x2,y2], …]` in the image as per the given description. Here, `` is the description of the target object. The output bounding box is a quadruple $$[x_1,y_1,x_2,y_2]$$ composed of the coordinates of the top-left and bottom-right corners, where each value is normalized by the image @@ -100,16 +100,12 @@ In the response, the special tokens `<|begin_of_box|>` and `<|end_of_box|>` are the answer. The bracket style may vary ([], [[]], (), <>, etc.), but the meaning is the same: to enclose the coordinates of the box. -## Cooperation with vLLM and Z.ai Team +## Cooperation with vLLM and GLM Team -During the release of the GLM-4.5 and GLM-4.5V models, the vLLM team worked closely with the Z.ai team, providing -extensive support in addressing issues related to the model launch. -The GLM-4.5 and GLM-4.5V models provided by the Z.ai team were modified in the vLLM implementation PR, including (but -not limited to) resolving [CUDA Core Dump](./2025-08-11-cuda-debugging.md) debugging issues and FP8 model accuracy -alignment problems. -They also ensured that the vLLM `main` branch had full support for the open-source GLM-4.5 series before the models were -released. +Before the release of the GLM-4.5 and GLM-4.5V models, the vLLM team worked closely with the GLM team, providing +extensive support in addressing issues related to the model launch, ensuring that the vLLM `main` branch had full +support for the open-source GLM-4.5 series before the models were released. ## Acknowledgement -We would like to thank the vLLM team members who contributed to this effort are: Simon Mo, Kaichao You. +We would like to thank the vLLM team members who contributed to this effort, including: Simon Mo, Kaichao You. From bbc8bd7ab9de596a6d6e620bfa8a298df9d31f70 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Mon, 18 Aug 2025 16:40:24 +0800 Subject: [PATCH 6/8] polish Signed-off-by: youkaichao --- _posts/2025-08-15-glm45-vllm.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-08-15-glm45-vllm.md b/_posts/2025-08-15-glm45-vllm.md index 6b15407..2629678 100644 --- a/_posts/2025-08-15-glm45-vllm.md +++ b/_posts/2025-08-15-glm45-vllm.md @@ -108,4 +108,4 @@ support for the open-source GLM-4.5 series before the models were released. ## Acknowledgement -We would like to thank the vLLM team members who contributed to this effort, including: Simon Mo, Kaichao You. +We would like to thank many people from the vLLM side who contributed to this effort, including: Kaichao You, Simon Mo, Zifeng Mo, Lucia Fang, Rui Qiao, Jie Le, Ce Gao, Roger Wang, Lu Fang, Wentao Ye, and Zixi Qi. From da41caf58c4b8acd6dd8d80f599366ed209e8e4a Mon Sep 17 00:00:00 2001 From: youkaichao Date: Mon, 18 Aug 2025 16:44:05 +0800 Subject: [PATCH 7/8] polish Signed-off-by: youkaichao --- _posts/2025-08-15-glm45-vllm.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-08-15-glm45-vllm.md b/_posts/2025-08-15-glm45-vllm.md index 2629678..e5042c8 100644 --- a/_posts/2025-08-15-glm45-vllm.md +++ b/_posts/2025-08-15-glm45-vllm.md @@ -7,7 +7,7 @@ image: /assets/logos/vllm-logo-text-light.png ## Introduction -[General Language Model (GLM)](https://aclanthology.org/2022.acl-long.26/) is a family of foundation models created by Zhipu.ai (now renamed to [Z.ai](https://z.ai/)). The GLM team has long-term collaboration with vLLM team, dating back to the early days of vLLM and the popular [ChatGLM model series](https://github.com/zai-org/ChatGLM-6B). Recently, the GLM team released the GLM-4.5 and GLM-4.5V model series, which are designed for intelligent agents. They are the top trending models in Huggingface model hub right now. +[General Language Model (GLM)](https://aclanthology.org/2022.acl-long.26/) is a family of foundation models created by Zhipu.ai (now renamed to [Z.ai](https://z.ai/)). The GLM team has long-term collaboration with vLLM team, dating back to the early days of vLLM and the popular [ChatGLM model series](https://github.com/zai-org/ChatGLM-6B). Recently, the GLM team released the [GLM-4.5](https://arxiv.org/abs/2508.06471) and [GLM-4.5V](https://arxiv.org/abs/2507.01006) model series, which are designed for intelligent agents. They are the top trending models in Huggingface model hub right now. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total From 511c1f1c7cc6791948e49928a8dc418354878a8a Mon Sep 17 00:00:00 2001 From: youkaichao Date: Tue, 19 Aug 2025 15:48:22 +0800 Subject: [PATCH 8/8] update date Signed-off-by: youkaichao --- _posts/{2025-08-15-glm45-vllm.md => 2025-08-18-glm45-vllm.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename _posts/{2025-08-15-glm45-vllm.md => 2025-08-18-glm45-vllm.md} (100%) diff --git a/_posts/2025-08-15-glm45-vllm.md b/_posts/2025-08-18-glm45-vllm.md similarity index 100% rename from _posts/2025-08-15-glm45-vllm.md rename to _posts/2025-08-18-glm45-vllm.md