Skip to content

Conversation

@staugust
Copy link
Contributor

@staugust staugust commented Nov 17, 2025

flashinfer attention use 2 as base of lse instead of e, see https://github.com/flashinfer-ai/flashinfer/blob/main/include/flashinfer/attention/mla.cuh#L400

Purpose

correct attn output with proper factor when using context parallel.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses the issue of FlashInfer using base 2 for its log-sum-exp calculations by introducing a new Triton kernel and a parameter to switch between base e and base 2 computations. The changes are logically sound and correctly applied where FlashInfer is used. My main feedback is on the implementation detail of adding a new Triton kernel, which introduces significant code duplication that could be avoided for better maintainability.

@chatgpt-codex-connector
Copy link

💡 Codex Review

output_context, lse_context = cp_lse_ag_out_rs(
output_context_tmp, lse_context_tmp, get_dcp_group(), return_lse=True, is_lse_base_on_e=False,
)
lse_context = lse_context.transpose(0, 1).contiguous()
output_query, lse_query = self._new_tokens.run(
prefill_query,
key,
value,
return_lse=True,
)
lse_query = lse_query.transpose(0, 1).contiguous()
merge_attn_states(
out,
output_context,
lse_context,
output_query,
lse_query,

P1 Badge Convert flashinfer LSEs to natural log before merging

When cp_lse_ag_out_rs is called with is_lse_base_on_e=False the new kernel returns log-sum-exp values in base‑2. Immediately after, these LSEs are passed to merge_attn_states, whose implementation assumes natural logarithms (tl.exp/tl.log). Without converting the base‑2 LSEs (e.g., multiply by math.log(2) for both lse_context and the lse_query returned from _new_tokens.run), the scaling factors inside merge_attn_states are computed against the wrong exponent base, producing incorrect weighting when combining context and query attention states under context parallelism.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@staugust
Copy link
Contributor Author

staugust commented Nov 19, 2025

@pavanimajety Would you like to take a look at this issue? I'm wondering which repo to fix this, flashinfer or vllm.

@mergify
Copy link

mergify bot commented Nov 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @staugust.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 19, 2025
@heheda12345
Copy link
Collaborator

CC @pavanimajety

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ( @pavanimajety should look at too though since im not as familiar with when FlashInfer is base2 )

@staugust
Copy link
Contributor Author

@LucasWilkinson @heheda12345 @pavanimajety From state.cuh:45 ,we can figure that flashinfer use 2 as base for all lse computation.

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 28, 2025
@DarkLight1337 DarkLight1337 merged commit 9726e64 into vllm-project:main Nov 28, 2025
50 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 28, 2025
@hl475
Copy link
Contributor

hl475 commented Nov 29, 2025

@staugust can you please take a look of the failure in https://buildkite.com/vllm/ci/builds/41171/steps/table?jid=019ace6a-57d7-4bc7-a3aa-c99174395dbd and https://buildkite.com/vllm/ci/builds/41171/steps/table?jid=019ace6a-57db-4e0b-9528-e04a0af07b6a , e.g.


2025-11-29 00:52:08 PST | (Worker_TP0_DCP0 pid=3384) ERROR 11-29 00:52:08 [multiproc_executor.py:822]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/common.py", line 2064, in forward
-- | --
2025-11-29 00:52:08 PST | (Worker_TP1_DCP1 pid=3385) ERROR 11-29 00:52:08 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-11-29 00:52:08 PST | (Worker_TP0_DCP0 pid=3384) ERROR 11-29 00:52:08 [multiproc_executor.py:822]     is_lse_base_on_e=not self._use_fi_prefill,
2025-11-29 00:52:08 PST | (Worker_TP1_DCP1 pid=3385) ERROR 11-29 00:52:08 [multiproc_executor.py:822]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
2025-11-29 00:52:08 PST | (Worker_TP0_DCP0 pid=3384) ERROR 11-29 00:52:08 [multiproc_executor.py:822]                          ^^^^^^^^^^^^^^^^^^^^
2025-11-29 00:52:08 PST | (Worker_TP1_DCP1 pid=3385) ERROR 11-29 00:52:08 [multiproc_executor.py:822]     raise e
2025-11-29 00:52:08 PST | (Worker_TP0_DCP0 pid=3384) ERROR 11-29 00:52:08 [multiproc_executor.py:822] AttributeError: 'CutlassMLAImpl' object has no attribute '_use_fi_prefill'


Do you think it is relevant? Thanks!

@hl475
Copy link
Contributor

hl475 commented Nov 29, 2025

Trying to do a forward fix in #29734

@staugust
Copy link
Contributor Author

@hl475 It's relevant. Thank you very much for fixing no attribute error.

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
amd-hhashemi pushed a commit to amd-hhashemi/vllm that referenced this pull request Dec 2, 2025
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants