diff --git a/vllm/README.md b/vllm/README.md index f81ab0c..1087638 100644 --- a/vllm/README.md +++ b/vllm/README.md @@ -23,6 +23,7 @@ llm-scaler-vllm is an extended and optimized version of vLLM, specifically adapt 2.7 [Finding maximum Context Length](#27-finding-maximum-context-length) 2.8 [Multi-Modal Webui](#28-multi-modal-webui) 2.9 [Multi-node Distributed Deployment (PP/TP)](#29-multi-node-distributed-deployment-pptp) + 2.10 [BPE-Qwen Tokenizer](#210-bpe-qwen-tokenizer) 4. [Supported Models](#3-supported-models) 5. [Troubleshooting](#4-troubleshooting) 6. [Performance tuning](#5-performance-tuning) @@ -2036,6 +2037,7 @@ To use fp8 online quantization, simply replace `--quantization sym_int4` with: ``` For those models that have been quantized before, such as AWQ-Int4/GPTQ-Int4/FP8 models, user do not need to specify the `--quantization` option. + --- ### 2.3 Embedding and Reranker Model Support @@ -2065,6 +2067,7 @@ python3 -m vllm.entrypoints.openai.api_server \ --- After starting the vLLM service, you can follow this link to use it. + #### [Embedding api](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#embeddings-api_1) ```bash @@ -2179,6 +2182,7 @@ curl http://localhost:8000/v1/chat/completions \ ``` if want to process image in server local, you can `"url": "file:/llm/models/test/1.jpg"` to test. + --- ### 2.4.1 Audio Model Support [Deprecated] @@ -2228,52 +2232,39 @@ curl http://localhost:8000/v1/audio/transcriptions \ ### 2.4.2 dots.ocr Support -Git clone the repo: +To launch `dots.ocr`, follow the instructions in [1.4 Launching the Serving Service](#14-launching-the-serving-service), specifying the dots.ocr model, setting the model path to `/llm/models/dots.ocr`, the served-model-name to `model`, and the port to 8000. -```bash -https://github.com/rednote-hilab/dots.ocr.git -cd dots.ocr -``` -Then, we should comment out the following two items in `requirements.txt`: +Once the service is running, you can use the method provided in the `dots.ocr` repository to launch Gradio for testing. -> flash-attn==2.8.0.post2 and accelerate # because these two dependencies will require cuda +--- -After commenting out these two elements, we can install the dependencies: +#### Clone the repository ```bash -# Assuming you have installed torch/ipex etc. -pip install --no-deps accelerate -pip install -e . +git clone https://github.com/rednote-hilab/dots.ocr.git +cd dots.ocr ``` -To download model weights in `dots.ocr`: -```bash -# In dots.ocr -python3 tools/download_model.py +#### Modify dependencies -# with modelscope -python3 tools/download_model.py --type modelscope +Comment out the following two lines in `requirements.txt`: + +``` +flash-attn==2.8.0.post2 +transformers # These two dependencies can cause conflicts ``` -In order to run dots.ocr, we will need to change codes in `./weights/DotsOCR`: +Then install the dependencies: ```bash -cd ./weights/DotsOCR -patch -p1 < YOUR_PATH/dots_ocr.patch +pip install -e . ``` -Then, you're ready to start: +#### Launch Gradio for testing ```bash -export hf_model_path=./weights/DotsOCR # Path to your downloaded model weights, Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers. -export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH -sed -i '/^from vllm\.version import __version__ as VLLM_VERSION$/a\ -from DotsOCR import modeling_dots_ocr_vllm' /usr/local/lib/python3.12/dist-packages/vllm-0.10.1.dev0+g6d8d0a24c.d20250825.xpu-py3.12-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py -# If you downloaded model weights by yourself, please replace `DotsOCR` by your model saved directory name, and remember to use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) - -# Start the service: -TORCH_LLM_ALLREDUCE=1 VLLM_USE_V1=1 CCL_ZE_IPC_EXCHANGE=pidfd VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 -m vllm.entrypoints.openai.api_server --model YOUR_DOTSOCR_PATH --enforce-eager --host 0.0.0.0 --trust-remote-code --disable-sliding-window --gpu-memory-util=0.8 --no-enable-prefix-caching --max-num-batched-tokens=8192 --disable-log-requests --max-model-len=40000 --block-size 64 -tp=1 --port 8000 --served-model-name DotsOCR --chat-template-content-format string --dtype bfloat16 +python demo/demo_gradio.py ``` --- @@ -2678,11 +2669,24 @@ python3 -m vllm.entrypoints.openai.api_server \ --distributed-executor-backend ray ``` ---- + At this point, multi-node distributed inference with **PP + TP** is running, coordinated by **Ray** across Node-1 and Node-2. --- + +### 2.10 BPE-Qwen Tokenizer + +We have integrated the **bpe-qwen tokenizer** to accelerate tokenization for Qwen models. + +To enable it when launching the API server, add: + +```bash +--tokenizer-mode bpe_qwen +``` + +--- + ## 3. Supported Models | Model Name | Category | Notes |