Releases · intel/llm-scaler · GitHub

25 Jul 07:05

liu-shaojun

llm-scaler-vllm beta release 0.2.0-b2 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.2.0-b2

What’s new

llm-scaler-vllm: Developed a customized downstream version of vLLM with the following key features:
- int4/fp8 online quantization
- Support for embedding, rerank model
- Enhanced multi-modal model support
- Performance improvements
- Maximum length auto-detecting
- Data parallelism support
- Fixed performance degradation issue
- Fixed multi-modal OOM issue
- Fixed MiniCPM wrong output issue

Assets 2

11 Jul 02:31

liu-shaojun

llm-scaler-vllm beta release 0.2.0-b1 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.2.0-b1

What’s new

llm-scaler-vllm: Developed a customized downstream version of vLLM with the following key features:
- Support for encoder models such as BGE-M3
- Added Embedding and Rerank interfaces for enhanced downstream capabilities
- Integrated Qwen2.5-VL with FP8/FP16 support for multi-modal generation
- Automatic detection of maximum supported sequence length when a large max-context-length is specified
- Added support for Qwen3 series models, including our fix on Qwen3's RMSNorm
- Broader multi-modal model support
- Data parallelism with verified near-linear scaling
- Symmetric int4 online quantization
- FP8 online quantization on CPU
- Communication support for both SHM (shared memory) and P2P (peer-to-peer) modes

Verified Features

Encoder and multi-modal models verified, including BGE-M3, Qwen2.5-VL, and Qwen3
Data parallelism tested with near-linear scaling across multiple GPUs
Verified FP8 and sym-int4 online quantization, including FP8 on CPU
Validated Qwen3 RMSNorm fix in both encoder and decoder paths
SHM and P2P support verified independently; automatic detection of SHM or P2P mode also confirmed

Assets 2

04 Jul 02:25

glorysdj

vllm-0.2.0-pre-release

llm-scaler-vllm pre-production release 0.2.0 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.2.0-pre-release

What’s new

oneCCL reduces the buffer size and published official release in github.
GQA kernel brings up-to 30% improvement for models.
Bugfix for OOM issues exposed by stress test (more tests are ongoing).
Support 70B FP8 TP4 in offline mode.
DeepSeek-v2-lite accuracy fix.
Other bugfixes.

Verified Features

Refresh the KPI functionality and performance on 4x and 8x BMG e211 system. All KPI models now meet the goal. Add FP8 performance of DS-Distilled-LLaMA 70B model measured on 4xBMG w/ TP4 under offline mode.
FP8 functionality test for 32K-8K(ISL/OSL) on DS-Distilled-Qwen32B model on 4xBMG w/ TP4.
Verified model list for FP8 functionality.

Assets 2