Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 34 additions & 30 deletions vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ llm-scaler-vllm is an extended and optimized version of vLLM, specifically adapt
2.7 [Finding maximum Context Length](#27-finding-maximum-context-length)
2.8 [Multi-Modal Webui](#28-multi-modal-webui)
2.9 [Multi-node Distributed Deployment (PP/TP)](#29-multi-node-distributed-deployment-pptp)
2.10 [BPE-Qwen Tokenizer](#210-bpe-qwen-tokenizer)
4. [Supported Models](#3-supported-models)
5. [Troubleshooting](#4-troubleshooting)
6. [Performance tuning](#5-performance-tuning)
Expand Down Expand Up @@ -2036,6 +2037,7 @@ To use fp8 online quantization, simply replace `--quantization sym_int4` with:
```

For those models that have been quantized before, such as AWQ-Int4/GPTQ-Int4/FP8 models, user do not need to specify the `--quantization` option.

---

### 2.3 Embedding and Reranker Model Support
Expand Down Expand Up @@ -2065,6 +2067,7 @@ python3 -m vllm.entrypoints.openai.api_server \

---
After starting the vLLM service, you can follow this link to use it.

#### [Embedding api](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#embeddings-api_1)

```bash
Expand Down Expand Up @@ -2179,6 +2182,7 @@ curl http://localhost:8000/v1/chat/completions \
```

if want to process image in server local, you can `"url": "file:/llm/models/test/1.jpg"` to test.

---

### 2.4.1 Audio Model Support [Deprecated]
Expand Down Expand Up @@ -2228,52 +2232,39 @@ curl http://localhost:8000/v1/audio/transcriptions \

### 2.4.2 dots.ocr Support

Git clone the repo:
To launch `dots.ocr`, follow the instructions in [1.4 Launching the Serving Service](#14-launching-the-serving-service), specifying the dots.ocr model, setting the model path to `/llm/models/dots.ocr`, the served-model-name to `model`, and the port to 8000.

```bash
https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr
```

Then, we should comment out the following two items in `requirements.txt`:
Once the service is running, you can use the method provided in the `dots.ocr` repository to launch Gradio for testing.

> flash-attn==2.8.0.post2 and accelerate # because these two dependencies will require cuda
---

After commenting out these two elements, we can install the dependencies:
#### Clone the repository

```bash
# Assuming you have installed torch/ipex etc.
pip install --no-deps accelerate
pip install -e .
git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr
```

To download model weights in `dots.ocr`:
```bash
# In dots.ocr
python3 tools/download_model.py
#### Modify dependencies

# with modelscope
python3 tools/download_model.py --type modelscope
Comment out the following two lines in `requirements.txt`:

```
flash-attn==2.8.0.post2
transformers # These two dependencies can cause conflicts
```

In order to run dots.ocr, we will need to change codes in `./weights/DotsOCR`:
Then install the dependencies:

```bash
cd ./weights/DotsOCR
patch -p1 < YOUR_PATH/dots_ocr.patch
pip install -e .
```

Then, you're ready to start:
#### Launch Gradio for testing

```bash
export hf_model_path=./weights/DotsOCR # Path to your downloaded model weights, Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH
sed -i '/^from vllm\.version import __version__ as VLLM_VERSION$/a\
from DotsOCR import modeling_dots_ocr_vllm' /usr/local/lib/python3.12/dist-packages/vllm-0.10.1.dev0+g6d8d0a24c.d20250825.xpu-py3.12-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py
# If you downloaded model weights by yourself, please replace `DotsOCR` by your model saved directory name, and remember to use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`)

# Start the service:
TORCH_LLM_ALLREDUCE=1 VLLM_USE_V1=1 CCL_ZE_IPC_EXCHANGE=pidfd VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 -m vllm.entrypoints.openai.api_server --model YOUR_DOTSOCR_PATH --enforce-eager --host 0.0.0.0 --trust-remote-code --disable-sliding-window --gpu-memory-util=0.8 --no-enable-prefix-caching --max-num-batched-tokens=8192 --disable-log-requests --max-model-len=40000 --block-size 64 -tp=1 --port 8000 --served-model-name DotsOCR --chat-template-content-format string --dtype bfloat16
python demo/demo_gradio.py
```

---
Expand Down Expand Up @@ -2678,11 +2669,24 @@ python3 -m vllm.entrypoints.openai.api_server \
--distributed-executor-backend ray
```

---

At this point, multi-node distributed inference with **PP + TP** is running, coordinated by **Ray** across Node-1 and Node-2.

---


### 2.10 BPE-Qwen Tokenizer

We have integrated the **bpe-qwen tokenizer** to accelerate tokenization for Qwen models.

To enable it when launching the API server, add:

```bash
--tokenizer-mode bpe_qwen
```

---

## 3. Supported Models

| Model Name | Category | Notes |
Expand Down