Skip to content

Releases: intel/llm-scaler

llm-scaler-vllm beta release 0.2.0-b2

25 Jul 07:05
5a58f3f

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • llm-scaler-vllm: Developed a customized downstream version of vLLM with the following key features:
    • int4/fp8 online quantization
    • Support for embedding, rerank model
    • Enhanced multi-modal model support
    • Performance improvements
    • Maximum length auto-detecting
    • Data parallelism support
    • Fixed performance degradation issue
    • Fixed multi-modal OOM issue
    • Fixed MiniCPM wrong output issue

llm-scaler-vllm beta release 0.2.0-b1

11 Jul 02:31
fa68e67

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • llm-scaler-vllm: Developed a customized downstream version of vLLM with the following key features:

    • Support for encoder models such as BGE-M3
    • Added Embedding and Rerank interfaces for enhanced downstream capabilities
    • Integrated Qwen2.5-VL with FP8/FP16 support for multi-modal generation
    • Automatic detection of maximum supported sequence length when a large max-context-length is specified
    • Added support for Qwen3 series models, including our fix on Qwen3's RMSNorm
    • Broader multi-modal model support
    • Data parallelism with verified near-linear scaling
    • Symmetric int4 online quantization
    • FP8 online quantization on CPU
    • Communication support for both SHM (shared memory) and P2P (peer-to-peer) modes

Verified Features

  • Encoder and multi-modal models verified, including BGE-M3, Qwen2.5-VL, and Qwen3
  • Data parallelism tested with near-linear scaling across multiple GPUs
  • Verified FP8 and sym-int4 online quantization, including FP8 on CPU
  • Validated Qwen3 RMSNorm fix in both encoder and decoder paths
  • SHM and P2P support verified independently; automatic detection of SHM or P2P mode also confirmed

llm-scaler-vllm pre-production release 0.2.0

04 Jul 02:25

Choose a tag to compare

Highlights

Resources

What’s new

  • oneCCL reduces the buffer size and published official release in github.
  • GQA kernel brings up-to 30% improvement for models.
  • Bugfix for OOM issues exposed by stress test (more tests are ongoing).
  • Support 70B FP8 TP4 in offline mode.
  • DeepSeek-v2-lite accuracy fix.
  • Other bugfixes.

Verified Features

  • Refresh the KPI functionality and performance on 4x and 8x BMG e211 system. All KPI models now meet the goal. Add FP8 performance of DS-Distilled-LLaMA 70B model measured on 4xBMG w/ TP4 under offline mode.
  • FP8 functionality test for 32K-8K(ISL/OSL) on DS-Distilled-Qwen32B model on 4xBMG w/ TP4.
  • Verified model list for FP8 functionality.