-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
bugfix: correct attn output with base 2 or e #28840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses the issue of FlashInfer using base 2 for its log-sum-exp calculations by introducing a new Triton kernel and a parameter to switch between base e and base 2 computations. The changes are logically sound and correctly applied where FlashInfer is used. My main feedback is on the implementation detail of adding a new Triton kernel, which introduces significant code duplication that could be avoided for better maintainability.
💡 Codex Reviewvllm/vllm/v1/attention/backends/flashinfer.py Lines 252 to 270 in c4a2b74
When ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
|
@pavanimajety Would you like to take a look at this issue? I'm wondering which repo to fix this, flashinfer or vllm. |
|
This pull request has merge conflicts that must be resolved before it can be |
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ( @pavanimajety should look at too though since im not as familiar with when FlashInfer is base2 )
|
@LucasWilkinson @heheda12345 @pavanimajety From state.cuh:45 ,we can figure that flashinfer use 2 as base for all lse computation. |
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
|
@staugust can you please take a look of the failure in https://buildkite.com/vllm/ci/builds/41171/steps/table?jid=019ace6a-57d7-4bc7-a3aa-c99174395dbd and https://buildkite.com/vllm/ci/builds/41171/steps/table?jid=019ace6a-57db-4e0b-9528-e04a0af07b6a , e.g. Do you think it is relevant? Thanks! |
|
Trying to do a forward fix in #29734 |
|
@hl475 It's relevant. Thank you very much for fixing no attribute error. |
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com> Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
flashinfer attention use 2 as base of lse instead of e, see https://github.com/flashinfer-ai/flashinfer/blob/main/include/flashinfer/attention/mla.cuh#L400
Purpose
correct attn output with proper factor when using context parallel.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.