@@ -215,15 +215,12 @@ This way, only the first GPU will be mapped into the Docker container.
215215# ## 1.4 Launching the Serving Service
216216
217217` ` ` bash
218- TORCH_LLM_ALLREDUCE=1 \
219- VLLM_USE_V1=1 \
220- CCL_ZE_IPC_EXCHANGE=pidfd \
221218VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
222219VLLM_WORKER_MULTIPROC_METHOD=spawn \
223220python3 -m vllm.entrypoints.openai.api_server \
224221 --model /llm/models/DeepSeek-R1-Distill-Qwen-7B \
222+ --served-model-name DeepSeek-R1-Distill-Qwen-7B \
225223 --dtype=float16 \
226- --device=xpu \
227224 --enforce-eager \
228225 --port 8000 \
229226 --host 0.0.0.0 \
@@ -248,6 +245,7 @@ you can add the argument `--api-key xxx` for user authentication. Users are supp
248245python3 /llm/vllm/benchmarks/benchmark_serving.py \
249246 --model /llm/models/DeepSeek-R1-Distill-Qwen-7B \
250247 --dataset-name random \
248+ --served-model-name DeepSeek-R1-Distill-Qwen-7B \
251249 --random-input-len=1024 \
252250 --random-output-len=512 \
253251 --ignore-eos \
@@ -2021,15 +2019,11 @@ To enable online quantization using `llm-scaler-vllm`, specify the desired quant
20212019The following example shows how to launch the server with ` sym_int4 ` quantization:
20222020
20232021``` bash
2024- TORCH_LLM_ALLREDUCE=1 \
2025- VLLM_USE_V1=1 \
2026- CCL_ZE_IPC_EXCHANGE=pidfd \
20272022VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
20282023VLLM_WORKER_MULTIPROC_METHOD=spawn \
20292024python3 -m vllm.entrypoints.openai.api_server \
20302025 --model /llm/models/DeepSeek-R1-Distill-Qwen-7B \
20312026 --dtype=float16 \
2032- --device=xpu \
20332027 --enforce-eager \
20342028 --port 8000 \
20352029 --host 0.0.0.0 \
@@ -2045,28 +2039,26 @@ python3 -m vllm.entrypoints.openai.api_server \
20452039 -tp=1
20462040```
20472041
2048- To use fp8 quantization, simply replace ` --quantization sym_int4 ` with:
2042+ To use fp8 online quantization, simply replace ` --quantization sym_int4 ` with:
20492043
20502044``` bash
20512045--quantization fp8
20522046```
2047+
2048+ For those models that have been quantized before, such as AWQ-Int4/GPTQ-Int4/FP8 models, user do not need to specify the ` --quantization ` option.
20532049---
20542050
20552051### 2.3 Embedding and Reranker Model Support
20562052
2057- #### Start service using V0 engine
2053+ #### Start service with embedding task
20582054``` bash
2059- TORCH_LLM_ALLREDUCE=1 \
2060- VLLM_USE_V1=0 \
2061- CCL_ZE_IPC_EXCHANGE=pidfd \
20622055VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
20632056VLLM_WORKER_MULTIPROC_METHOD=spawn \
20642057python3 -m vllm.entrypoints.openai.api_server \
2065- --model /llm/models/bge-reranker-large \
2066- --served-model-name bge-reranker-large \
2058+ --model /llm/models/bge-m3 \
2059+ --served-model-name bge-m3 \
20672060 --task embed \
20682061 --dtype=float16 \
2069- --device=xpu \
20702062 --enforce-eager \
20712063 --port 8000 \
20722064 --host 0.0.0.0 \
@@ -2077,12 +2069,48 @@ python3 -m vllm.entrypoints.openai.api_server \
20772069 --max-num-batched-tokens=2048 \
20782070 --disable-log-requests \
20792071 --max-model-len=2048 \
2080- --block-size 16 \
2081- --quantization fp8 \
2072+ --block-size 64 \
20822073 -tp=1
20832074```
20842075
2085- After starting the vLLM service, you can follow these two links to use it.
2076+ ---
2077+ After starting the vLLM service, you can follow this link to use it.
2078+ #### [ Embedding api] ( https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#embeddings-api_1 )
2079+
2080+ ``` bash
2081+ curl http://localhost:8000/v1/embeddings \
2082+ -H " Content-Type: application/json" \
2083+ -d ' {
2084+ "input": ["需要嵌入文本1","这是第二个句子"],
2085+ "model": "bge-m3",
2086+ "encoding_format": "float"
2087+ }'
2088+ ```
2089+
2090+ #### Start service with classify task
2091+
2092+ ``` bash
2093+ VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
2094+ VLLM_WORKER_MULTIPROC_METHOD=spawn \
2095+ python3 -m vllm.entrypoints.openai.api_server \
2096+ --model /llm/models/bge-reranker-base \
2097+ --served-model-name bge-reranker-base \
2098+ --task classify \
2099+ --dtype=float16 \
2100+ --enforce-eager \
2101+ --port 8000 \
2102+ --host 0.0.0.0 \
2103+ --trust-remote-code \
2104+ --disable-sliding-window \
2105+ --gpu-memory-util=0.9 \
2106+ --no-enable-prefix-caching \
2107+ --max-num-batched-tokens=2048 \
2108+ --disable-log-requests \
2109+ --max-model-len=2048 \
2110+ --block-size 64 \
2111+ -tp=1
2112+ ```
2113+ After starting the vLLM service, you can follow this link to use it.
20862114#### [ Rerank api] ( https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#re-rank-api )
20872115
20882116``` bash
@@ -2091,7 +2119,7 @@ curl -X 'POST' \
20912119 -H ' accept: application/json' \
20922120 -H ' Content-Type: application/json' \
20932121 -d ' {
2094- "model": "bge-reranker-large ",
2122+ "model": "bge-reranker-base ",
20952123 "query": "What is the capital of France?",
20962124 "documents": [
20972125 "The capital of Brazil is Brasilia.",
@@ -2102,34 +2130,20 @@ curl -X 'POST' \
21022130}'
21032131```
21042132
2105- #### [ Embedding api] ( https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#embeddings-api_1 )
21062133
2107- ``` bash
2108- curl http://localhost:8000/v1/embeddings \
2109- -H " Content-Type: application/json" \
2110- -d ' {
2111- "input": ["需要嵌入文本1","这是第二个句子"],
2112- "model": "bge-m3",
2113- "encoding_format": "float"
2114- }'
2115- ```
21162134---
21172135
21182136### 2.4 Multi-Modal Model Support
21192137
21202138#### Start service using V1 engine
21212139``` bash
2122- TORCH_LLM_ALLREDUCE=1 \
2123- VLLM_USE_V1=1 \
2124- CCL_ZE_IPC_EXCHANGE=pidfd \
21252140VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
21262141VLLM_WORKER_MULTIPROC_METHOD=spawn \
21272142python3 -m vllm.entrypoints.openai.api_server \
21282143 --model /llm/models/Qwen2.5-VL-7B-Instruct \
21292144 --served-model-name Qwen2.5-VL-7B-Instruct \
21302145 --allowed-local-media-path /llm/models/test \
21312146 --dtype=float16 \
2132- --device=xpu \
21332147 --enforce-eager \
21342148 --port 8000 \
21352149 --host 0.0.0.0 \
@@ -2175,7 +2189,7 @@ curl http://localhost:8000/v1/chat/completions \
21752189```
21762190---
21772191
2178- ### 2.4.1 Audio Model Support
2192+ ### 2.4.1 Audio Model Support [ Deprecated ]
21792193
21802194#### Install audio dependencies
21812195``` bash
@@ -2194,7 +2208,6 @@ python3 -m vllm.entrypoints.openai.api_server \
21942208 --served-model-name whisper-medium \
21952209 --allowed-local-media-path /llm/models/test \
21962210 --dtype=float16 \
2197- --device=xpu \
21982211 --enforce-eager \
21992212 --port 8000 \
22002213 --host 0.0.0.0 \
@@ -2230,17 +2243,13 @@ pip install librosa soundfile
22302243
22312244#### Start service using V1 engine
22322245``` bash
2233- TORCH_LLM_ALLREDUCE=1 \
2234- VLLM_USE_V1=1 \
2235- CCL_ZE_IPC_EXCHANGE=pidfd \
22362246VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
22372247VLLM_WORKER_MULTIPROC_METHOD=spawn \
22382248python3 -m vllm.entrypoints.openai.api_server \
22392249 --model /llm/models/Qwen2.5-Omni-7B \
22402250 --served-model-name Qwen2.5-Omni-7B \
22412251 --allowed-local-media-path /llm/models/test \
22422252 --dtype=float16 \
2243- --device=xpu \
22442253 --enforce-eager \
22452254 --port 8000 \
22462255 --host 0.0.0.0 \
@@ -2412,17 +2421,20 @@ cd /llm
24122421python3 -m vllm.entrypoints.openai.api_server
24132422```
24142423
2424+ ### 4.2 Out-of-memory while online quantization
24152425
2416- ## 5. Performance tuning
2426+ When the model size is very large, running FP8 online quantization may cause out-of-memory errors.
24172427
2418- To maximize performance, configure the following environment variables inside the container :
2428+ To avoid this issue, set the following environment variable before starting the service :
24192429
24202430``` bash
2421- unset TRITON_XPU_PROFILE
2422- export VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=0
2431+ export VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1
24232432```
24242433
2425- In addition, you can optimize CPU affinity based on the GPU–NUMA topology.
2434+
2435+ ## 5. Performance tuning
2436+
2437+ To improve performance, you can optimize CPU affinity based on the GPU–NUMA topology.
24262438
24272439For example, if your process uses two GPUs that are both connected to NUMA node 0, you can use lscpu to identify the CPU cores associated with that NUMA node:
24282440
@@ -2442,8 +2454,3 @@ numactl -C 0-17 YOUR_COMMAND
24422454```
24432455
24442456This ensures that the CPU threads serving your GPUs remain on the optimal NUMA node, reducing memory access latency and improving throughput.
2445-
2446-
2447-
2448-
2449-
0 commit comments