Skip to content

Conversation

@sfc-gh-goliaro
Copy link

@sfc-gh-goliaro sfc-gh-goliaro commented Nov 28, 2025

Purpose

Enable automated FlashInfer trace collection and optimized kernel deployment via flashinfer-bench apply. This allows users to:

  1. Run vLLM inference with full FlashInfer backend
  2. Capture operator traces using FlashInfer-Bench
  3. Deploy custom optimized kernels back to vLLM

Changes

1. Full FlashInfer Backend (VLLM_USE_FLASHINFER=1)

Extended FlashInfer integration beyond attention to all supported operators:

Operator Status
Attention ✅ Working
Top-k/Top-p Sampling ✅ Working
RMSNorm ✅ Working
Activations (SiLU, GELU) ✅ Working
MoE FP16/BF16 ✅ Working
All2All ✅ Working
AllReduce Fusion ✅ Working (with patch)
MoE FP8 ⚠️ Requires CUDA 12.7+
MoE MXFP4 ⚠️ SM100 only

2. Test & Trace Scripts

3. Tested Models

  • Qwen3-30B-A3B MoE (BF16)
  • Llama-3.1-70B (BF16)
  • GPT-OSS-120B (MXFP4)

4. AllReduce Fusion Fix

Fixed the std::optionalcuda::std::optional bug in FlashInfer's trtllm_allreduce_fusion.cuh. The fix is documented in docs/source/design/flashinfer_integration_issues.md with a patch that can be applied to the installed FlashInfer package.

Usage

# Run with full FlashInfer backend
python -m vllm.entrypoints.openai.api_server --model <model>

# Test FlashInfer integration (VLLM_USE_FLASHINFER=1 set automatically)
python tests/kernels/run_flashinfer_test.py --model all

# Test with tensor parallelism (AllReduce auto-enabled for TP >= 2)
python tests/kernels/run_flashinfer_test.py --model qwen --tp 2

# Generate traces for FlashInfer-Bench optimization
pip install flashinfer-bench --no-deps
python tests/kernels/generate_flashinfer_traces.py --model qwen
flashinfer-bench run --local ~/.cache/flashinfer_bench/vllm_traces

TODOs

See docs/source/design/flashinfer_integration_issues.md for detailed issue analysis.

FlashInfer Team

  • Fix std::optionalcuda::std::optional in trtllm_allreduce_fusion.cuh (patch available)
  • Implement MXFP4 MoE kernels for SM90 (Hopper)
  • Rebuild wheels with CUDA 12.7+ for FP8 MoE

vLLM Integration

  • Add LoRA kernels (Punica BGMV/SGMV) when available
  • Add embedding lookup kernels when available
  • Add KV cache block copy/swap kernels when available

Test Plan

# Basic FlashInfer test
python tests/kernels/run_flashinfer_test.py --model qwen

# Test with tensor parallelism and AllReduce fusion
python tests/kernels/run_flashinfer_test.py --model qwen --tp 2

# Test all models
python tests/kernels/run_flashinfer_test.py --model all

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Nov 28, 2025

Documentation preview: https://vllm--29695.org.readthedocs.build/en/29695/

@mergify mergify bot added documentation Improvements or additions to documentation nvidia labels Nov 28, 2025
@sfc-gh-goliaro sfc-gh-goliaro changed the title Flashinfer bench integration FlashInfer-Bench Integration for vLLM Nov 28, 2025
@mergify mergify bot added the v1 label Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nvidia v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant