Skip to content

Commit c04b5f5

Browse files
authored
Change to 0.10.0 (#60)
* fix * done * update
1 parent 017469f commit c04b5f5

File tree

4 files changed

+5607
-9382
lines changed

4 files changed

+5607
-9382
lines changed

vllm/README.md

Lines changed: 57 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -215,15 +215,12 @@ This way, only the first GPU will be mapped into the Docker container.
215215
### 1.4 Launching the Serving Service
216216

217217
```bash
218-
TORCH_LLM_ALLREDUCE=1 \
219-
VLLM_USE_V1=1 \
220-
CCL_ZE_IPC_EXCHANGE=pidfd \
221218
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
222219
VLLM_WORKER_MULTIPROC_METHOD=spawn \
223220
python3 -m vllm.entrypoints.openai.api_server \
224221
--model /llm/models/DeepSeek-R1-Distill-Qwen-7B \
222+
--served-model-name DeepSeek-R1-Distill-Qwen-7B \
225223
--dtype=float16 \
226-
--device=xpu \
227224
--enforce-eager \
228225
--port 8000 \
229226
--host 0.0.0.0 \
@@ -248,6 +245,7 @@ you can add the argument `--api-key xxx` for user authentication. Users are supp
248245
python3 /llm/vllm/benchmarks/benchmark_serving.py \
249246
--model /llm/models/DeepSeek-R1-Distill-Qwen-7B \
250247
--dataset-name random \
248+
--served-model-name DeepSeek-R1-Distill-Qwen-7B \
251249
--random-input-len=1024 \
252250
--random-output-len=512 \
253251
--ignore-eos \
@@ -2021,15 +2019,11 @@ To enable online quantization using `llm-scaler-vllm`, specify the desired quant
20212019
The following example shows how to launch the server with `sym_int4` quantization:
20222020

20232021
```bash
2024-
TORCH_LLM_ALLREDUCE=1 \
2025-
VLLM_USE_V1=1 \
2026-
CCL_ZE_IPC_EXCHANGE=pidfd \
20272022
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
20282023
VLLM_WORKER_MULTIPROC_METHOD=spawn \
20292024
python3 -m vllm.entrypoints.openai.api_server \
20302025
--model /llm/models/DeepSeek-R1-Distill-Qwen-7B \
20312026
--dtype=float16 \
2032-
--device=xpu \
20332027
--enforce-eager \
20342028
--port 8000 \
20352029
--host 0.0.0.0 \
@@ -2045,28 +2039,26 @@ python3 -m vllm.entrypoints.openai.api_server \
20452039
-tp=1
20462040
```
20472041

2048-
To use fp8 quantization, simply replace `--quantization sym_int4` with:
2042+
To use fp8 online quantization, simply replace `--quantization sym_int4` with:
20492043

20502044
```bash
20512045
--quantization fp8
20522046
```
2047+
2048+
For those models that have been quantized before, such as AWQ-Int4/GPTQ-Int4/FP8 models, user do not need to specify the `--quantization` option.
20532049
---
20542050

20552051
### 2.3 Embedding and Reranker Model Support
20562052

2057-
#### Start service using V0 engine
2053+
#### Start service with embedding task
20582054
```bash
2059-
TORCH_LLM_ALLREDUCE=1 \
2060-
VLLM_USE_V1=0 \
2061-
CCL_ZE_IPC_EXCHANGE=pidfd \
20622055
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
20632056
VLLM_WORKER_MULTIPROC_METHOD=spawn \
20642057
python3 -m vllm.entrypoints.openai.api_server \
2065-
--model /llm/models/bge-reranker-large \
2066-
--served-model-name bge-reranker-large \
2058+
--model /llm/models/bge-m3 \
2059+
--served-model-name bge-m3 \
20672060
--task embed \
20682061
--dtype=float16 \
2069-
--device=xpu \
20702062
--enforce-eager \
20712063
--port 8000 \
20722064
--host 0.0.0.0 \
@@ -2077,12 +2069,48 @@ python3 -m vllm.entrypoints.openai.api_server \
20772069
--max-num-batched-tokens=2048 \
20782070
--disable-log-requests \
20792071
--max-model-len=2048 \
2080-
--block-size 16 \
2081-
--quantization fp8 \
2072+
--block-size 64 \
20822073
-tp=1
20832074
```
20842075

2085-
After starting the vLLM service, you can follow these two links to use it.
2076+
---
2077+
After starting the vLLM service, you can follow this link to use it.
2078+
#### [Embedding api](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#embeddings-api_1)
2079+
2080+
```bash
2081+
curl http://localhost:8000/v1/embeddings \
2082+
-H "Content-Type: application/json" \
2083+
-d '{
2084+
"input": ["需要嵌入文本1","这是第二个句子"],
2085+
"model": "bge-m3",
2086+
"encoding_format": "float"
2087+
}'
2088+
```
2089+
2090+
#### Start service with classify task
2091+
2092+
```bash
2093+
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
2094+
VLLM_WORKER_MULTIPROC_METHOD=spawn \
2095+
python3 -m vllm.entrypoints.openai.api_server \
2096+
--model /llm/models/bge-reranker-base \
2097+
--served-model-name bge-reranker-base \
2098+
--task classify \
2099+
--dtype=float16 \
2100+
--enforce-eager \
2101+
--port 8000 \
2102+
--host 0.0.0.0 \
2103+
--trust-remote-code \
2104+
--disable-sliding-window \
2105+
--gpu-memory-util=0.9 \
2106+
--no-enable-prefix-caching \
2107+
--max-num-batched-tokens=2048 \
2108+
--disable-log-requests \
2109+
--max-model-len=2048 \
2110+
--block-size 64 \
2111+
-tp=1
2112+
```
2113+
After starting the vLLM service, you can follow this link to use it.
20862114
#### [Rerank api](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#re-rank-api)
20872115

20882116
```bash
@@ -2091,7 +2119,7 @@ curl -X 'POST' \
20912119
-H 'accept: application/json' \
20922120
-H 'Content-Type: application/json' \
20932121
-d '{
2094-
"model": "bge-reranker-large",
2122+
"model": "bge-reranker-base",
20952123
"query": "What is the capital of France?",
20962124
"documents": [
20972125
"The capital of Brazil is Brasilia.",
@@ -2102,34 +2130,20 @@ curl -X 'POST' \
21022130
}'
21032131
```
21042132

2105-
#### [Embedding api](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#embeddings-api_1)
21062133

2107-
```bash
2108-
curl http://localhost:8000/v1/embeddings \
2109-
-H "Content-Type: application/json" \
2110-
-d '{
2111-
"input": ["需要嵌入文本1","这是第二个句子"],
2112-
"model": "bge-m3",
2113-
"encoding_format": "float"
2114-
}'
2115-
```
21162134
---
21172135

21182136
### 2.4 Multi-Modal Model Support
21192137

21202138
#### Start service using V1 engine
21212139
```bash
2122-
TORCH_LLM_ALLREDUCE=1 \
2123-
VLLM_USE_V1=1 \
2124-
CCL_ZE_IPC_EXCHANGE=pidfd \
21252140
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
21262141
VLLM_WORKER_MULTIPROC_METHOD=spawn \
21272142
python3 -m vllm.entrypoints.openai.api_server \
21282143
--model /llm/models/Qwen2.5-VL-7B-Instruct \
21292144
--served-model-name Qwen2.5-VL-7B-Instruct \
21302145
--allowed-local-media-path /llm/models/test \
21312146
--dtype=float16 \
2132-
--device=xpu \
21332147
--enforce-eager \
21342148
--port 8000 \
21352149
--host 0.0.0.0 \
@@ -2175,7 +2189,7 @@ curl http://localhost:8000/v1/chat/completions \
21752189
```
21762190
---
21772191

2178-
### 2.4.1 Audio Model Support
2192+
### 2.4.1 Audio Model Support [Deprecated]
21792193

21802194
#### Install audio dependencies
21812195
```bash
@@ -2194,7 +2208,6 @@ python3 -m vllm.entrypoints.openai.api_server \
21942208
--served-model-name whisper-medium \
21952209
--allowed-local-media-path /llm/models/test \
21962210
--dtype=float16 \
2197-
--device=xpu \
21982211
--enforce-eager \
21992212
--port 8000 \
22002213
--host 0.0.0.0 \
@@ -2230,17 +2243,13 @@ pip install librosa soundfile
22302243

22312244
#### Start service using V1 engine
22322245
```bash
2233-
TORCH_LLM_ALLREDUCE=1 \
2234-
VLLM_USE_V1=1 \
2235-
CCL_ZE_IPC_EXCHANGE=pidfd \
22362246
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
22372247
VLLM_WORKER_MULTIPROC_METHOD=spawn \
22382248
python3 -m vllm.entrypoints.openai.api_server \
22392249
--model /llm/models/Qwen2.5-Omni-7B \
22402250
--served-model-name Qwen2.5-Omni-7B \
22412251
--allowed-local-media-path /llm/models/test \
22422252
--dtype=float16 \
2243-
--device=xpu \
22442253
--enforce-eager \
22452254
--port 8000 \
22462255
--host 0.0.0.0 \
@@ -2412,17 +2421,20 @@ cd /llm
24122421
python3 -m vllm.entrypoints.openai.api_server
24132422
```
24142423

2424+
### 4.2 Out-of-memory while online quantization
24152425

2416-
## 5. Performance tuning
2426+
When the model size is very large, running FP8 online quantization may cause out-of-memory errors.
24172427

2418-
To maximize performance, configure the following environment variables inside the container:
2428+
To avoid this issue, set the following environment variable before starting the service:
24192429

24202430
```bash
2421-
unset TRITON_XPU_PROFILE
2422-
export VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=0
2431+
export VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1
24232432
```
24242433

2425-
In addition, you can optimize CPU affinity based on the GPU–NUMA topology.
2434+
2435+
## 5. Performance tuning
2436+
2437+
To improve performance, you can optimize CPU affinity based on the GPU–NUMA topology.
24262438

24272439
For example, if your process uses two GPUs that are both connected to NUMA node 0, you can use lscpu to identify the CPU cores associated with that NUMA node:
24282440

@@ -2442,8 +2454,3 @@ numactl -C 0-17 YOUR_COMMAND
24422454
```
24432455

24442456
This ensures that the CPU threads serving your GPUs remain on the optimal NUMA node, reducing memory access latency and improving throughput.
2445-
2446-
2447-
2448-
2449-

0 commit comments

Comments
 (0)