FlashInfer-Bench Integration for vLLM #29695

sfc-gh-goliaro · 2025-11-28T18:40:49Z

Purpose

Enable automated FlashInfer trace collection and optimized kernel deployment via flashinfer-bench apply. This allows users to:

Run vLLM inference with full FlashInfer backend
Capture operator traces using FlashInfer-Bench
Deploy custom optimized kernels back to vLLM

Changes

1. Full FlashInfer Backend (`VLLM_USE_FLASHINFER=1`)

Extended FlashInfer integration beyond attention to all supported operators:

Operator	Status
Attention	✅ Working
Top-k/Top-p Sampling	✅ Working
RMSNorm	✅ Working
Activations (SiLU, GELU)	✅ Working
MoE FP16/BF16	✅ Working
All2All	✅ Working
AllReduce Fusion	✅ Working (with patch)
MoE FP8	⚠️ Requires CUDA 12.7+
MoE MXFP4	⚠️ SM100 only

2. Test & Trace Scripts

tests/kernels/run_flashinfer_test.py - E2E test for FlashInfer operators
- Automatically sets VLLM_USE_FLASHINFER=1
- Automatically enables AllReduce fusion for TP >= 2
tests/kernels/generate_flashinfer_traces.py - Trace generation for FlashInfer-Bench

3. Tested Models

Qwen3-30B-A3B MoE (BF16)
Llama-3.1-70B (BF16)
GPT-OSS-120B (MXFP4)

4. AllReduce Fusion Fix

Fixed the std::optional → cuda::std::optional bug in FlashInfer's trtllm_allreduce_fusion.cuh. The fix is documented in docs/source/design/flashinfer_integration_issues.md with a patch that can be applied to the installed FlashInfer package.

Usage

# Run with full FlashInfer backend
python -m vllm.entrypoints.openai.api_server --model <model>

# Test FlashInfer integration (VLLM_USE_FLASHINFER=1 set automatically)
python tests/kernels/run_flashinfer_test.py --model all

# Test with tensor parallelism (AllReduce auto-enabled for TP >= 2)
python tests/kernels/run_flashinfer_test.py --model qwen --tp 2

# Generate traces for FlashInfer-Bench optimization
pip install flashinfer-bench --no-deps
python tests/kernels/generate_flashinfer_traces.py --model qwen
flashinfer-bench run --local ~/.cache/flashinfer_bench/vllm_traces

TODOs

See docs/source/design/flashinfer_integration_issues.md for detailed issue analysis.

FlashInfer Team

~~Fix std::optional → cuda::std::optional in trtllm_allreduce_fusion.cuh~~ (patch available)
Implement MXFP4 MoE kernels for SM90 (Hopper)
Rebuild wheels with CUDA 12.7+ for FP8 MoE

vLLM Integration

Add LoRA kernels (Punica BGMV/SGMV) when available
Add embedding lookup kernels when available
Add KV cache block copy/swap kernels when available

Test Plan

# Basic FlashInfer test
python tests/kernels/run_flashinfer_test.py --model qwen

# Test with tensor parallelism and AllReduce fusion
python tests/kernels/run_flashinfer_test.py --model qwen --tp 2

# Test all models
python tests/kernels/run_flashinfer_test.py --model all

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-11-28T18:41:26Z

Documentation preview: https://vllm--29695.org.readthedocs.build/en/29695/

sfc-gh-goliaro added 2 commits November 26, 2025 04:32

init

29e6c50

update

501a567

mergify bot added documentation Improvements or additions to documentation nvidia labels Nov 28, 2025

github-project-automation bot added this to NVIDIA Nov 28, 2025

add script to get flashinfer trace

32764a0

sfc-gh-goliaro changed the title ~~Flashinfer bench integration~~ FlashInfer-Bench Integration for vLLM Nov 28, 2025

sfc-gh-goliaro added 3 commits November 28, 2025 19:06

update

117d8d0

flashinfer allreduce jit bug fix patch

44fb202

update

1693234

mergify bot added the v1 label Nov 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FlashInfer-Bench Integration for vLLM #29695

FlashInfer-Bench Integration for vLLM #29695

sfc-gh-goliaro commented Nov 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

FlashInfer-Bench Integration for vLLM #29695

Are you sure you want to change the base?

FlashInfer-Bench Integration for vLLM #29695

Conversation

sfc-gh-goliaro commented Nov 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

1. Full FlashInfer Backend (VLLM_USE_FLASHINFER=1)

2. Test & Trace Scripts

3. Tested Models

4. AllReduce Fusion Fix

Usage

TODOs

FlashInfer Team

vLLM Integration

Test Plan

Test Result

Uh oh!

mergify bot commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sfc-gh-goliaro commented Nov 28, 2025 •

edited by github-actions bot

Loading

1. Full FlashInfer Backend (`VLLM_USE_FLASHINFER=1`)