-
Notifications
You must be signed in to change notification settings - Fork 585
enable sm103 moe dsl backend #2149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis change extends SM architecture support for BlockScaledPersistentDenseGemmKernel to include sm_103 with a temporary shared memory capacity workaround, updates the CUTLASS DSL dependency version constraint, and refactors device capability checks in tests to use a supported-versions list approach. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @aleozlx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the FlashInfer library by enabling support for the SM103 GPU architecture within its MoE DSL backend. This change broadens the range of hardware compatible with the library's optimized operations, ensuring that users with newer NVIDIA GPUs can leverage the performance benefits. The update involves adjusting internal version checks, updating a key dependency, and modifying test configurations to validate the new support. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables the sm103 moe dsl backend and updates the nvidia-cutlass-dsl dependency. The changes look good, especially the updates to the supported SM versions in the kernel and the tests. I have one suggestion regarding a temporary hack to improve its robustness and maintainability.
| # HACK "sm_103" doesn't work yet for the query | ||
| # https://github.com/NVIDIA/cutlass/blob/5016493cc0d8650d5b2f6d2c2751cf49bc217e86/python/CuTeDSL/cutlass/utils/smem_allocator.py#L19 | ||
| # self.smem_capacity = utils.get_smem_capacity_in_bytes(sm_version) | ||
| self.smem_capacity = utils.get_smem_capacity_in_bytes("sm_100") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this hack works for the currently supported SM versions, it's a bit fragile. Hardcoding "sm_100" will be incorrect if another SM version is added in the future that is supported by get_smem_capacity_in_bytes. A more robust approach would be to only apply the fallback for sm_103. This also makes the intent clearer and the hack easier to remove. I'd also recommend adding a TODO to track this technical debt.
| # HACK "sm_103" doesn't work yet for the query | |
| # https://github.com/NVIDIA/cutlass/blob/5016493cc0d8650d5b2f6d2c2751cf49bc217e86/python/CuTeDSL/cutlass/utils/smem_allocator.py#L19 | |
| # self.smem_capacity = utils.get_smem_capacity_in_bytes(sm_version) | |
| self.smem_capacity = utils.get_smem_capacity_in_bytes("sm_100") | |
| # TODO: Remove this workaround once nvidia-cutlass-dsl supports sm_103 for smem capacity queries. | |
| # HACK: "sm_103" is not yet supported by get_smem_capacity_in_bytes. Using "sm_100" as a fallback. | |
| # See: https://github.com/NVIDIA/cutlass/blob/5016493cc0d8650d5b2f6d2c2751cf49bc217e86/python/CuTeDSL/cutlass/utils/smem_allocator.py#L19 | |
| smem_query_version = "sm_100" if sm_version == "sm_103" else sm_version | |
| self.smem_capacity = utils.get_smem_capacity_in_bytes(smem_query_version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this get_smem_capacity_in_bytes issue has been reported to cutlass team internally, if there is a quick turn around for this i'll just patch it later. also since this is just one kernel, if it works, it works, this is not wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no problem with this at the moment, considering sm_100 and sm_103 should have the same shared memory size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tests/gemm/test_cute_dsl_blockscaled_gemm.py (1)
83-88: Device capability gating correctly adds sm_103; consider de‑duplicating the source of truthThe new
device_vergating to[(10, 0), (10, 3)]matches the backend’ssm_100/sm_103support and will correctly skip tests elsewhere. The only minor concern is that the list of supported SMs is now duplicated here and inSm100BlockScaledPersistentDenseGemmKernel; if you expect to add more SMs (e.g., future Blackwell variants), consider exposing a small shared constant or helper so tests and the kernel derive their supported set from the same place to avoid drift.flashinfer/cute_dsl/blockscaled_gemm.py (2)
532-535: sm_103 addition to supported SM list is consistent with grouped_gemm_nt_maskedAllowing both
"sm_100"and"sm_103"here lines up withsm_version=f"sm_{major}{minor}"ingrouped_gemm_nt_masked, so SM103 now passes the constructor guard as intended. If you foresee adding more SM variants, you might want to promotesupported_sm_versionsto a class- or module-level constant and import/reuse it from tests, so the supported set stays in sync across code paths.
566-570: SMEM capacity hack for sm_103 works but could be more future‑proofHard‑wiring
self.smem_capacitytoget_smem_capacity_in_bytes("sm_100")is a reasonable short‑term workaround to get sm_103 online, but it bakes in two assumptions:
- That sm_103’s effective SMEM capacity is ≥ sm_100 (otherwise
_compute_stagesmight choose too many stages).- That a future CUTLASS version will eventually support
"sm_103"but this code will still be pointing at"sm_100"unless someone remembers to remove the hack.To make this more robust, consider trying
get_smem_capacity_in_bytes(sm_version)first and only falling back to"sm_100"when the query actually fails (and only for"sm_103"), with a clear TODO to remove once CUTLASS exposes the right value. Also worth double‑checking on target hardware that the sm_100 capacity is indeed a safe bound for sm_103.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
flashinfer/cute_dsl/blockscaled_gemm.py(2 hunks)requirements.txt(1 hunks)tests/gemm/test_cute_dsl_blockscaled_gemm.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.
📚 Learning: 2025-11-12T03:35:17.583Z
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.
Applied to files:
flashinfer/cute_dsl/blockscaled_gemm.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build (cu130, amd64)
- GitHub Check: build (cu126, amd64)
- GitHub Check: build (cu129, arm64)
- GitHub Check: Deploy Docs
🔇 Additional comments (1)
requirements.txt (1)
7-7: CUTLASS DSL version bump aligns with enabling the SM103 backendRaising the minimum
nvidia-cutlass-dslto>=4.3.1is consistent with adding SM103 cute-dsl support. Please just confirm this version is available in your target environments/CI images and matches the minimum required by the new SM103 path.
|
UT GB300 pytest tests/gemm/test_cute_dsl_blockscaled_gemm.py 256 passed, 128 xfailed, 768 warnings |
|
/bot run |
|
[FAILED] Pipeline #39324091: 13/20 passed |
yzh119
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| # HACK "sm_103" doesn't work yet for the query | ||
| # https://github.com/NVIDIA/cutlass/blob/5016493cc0d8650d5b2f6d2c2751cf49bc217e86/python/CuTeDSL/cutlass/utils/smem_allocator.py#L19 | ||
| # self.smem_capacity = utils.get_smem_capacity_in_bytes(sm_version) | ||
| self.smem_capacity = utils.get_smem_capacity_in_bytes("sm_100") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no problem with this at the moment, considering sm_100 and sm_103 should have the same shared memory size.
📌 Description
enable sm103 moe dsl backend
bumped dep version nvidia-cutlass-dsl>=4.3.1
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.