enable sm103 moe dsl backend #2149

aleozlx · 2025-11-28T08:55:19Z

📌 Description

enable sm103 moe dsl backend

bumped dep version nvidia-cutlass-dsl>=4.3.1

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Extended GPU architecture support to include SM 103 alongside SM 100 for block-scaled matrix operations.
Chores
- Updated nvidia-cutlass-dsl minimum version requirement to 4.3.1.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-28T08:55:30Z

Walkthrough

This change extends SM architecture support for BlockScaledPersistentDenseGemmKernel to include sm_103 with a temporary shared memory capacity workaround, updates the CUTLASS DSL dependency version constraint, and refactors device capability checks in tests to use a supported-versions list approach.

Changes

Cohort / File(s)	Summary
SM Architecture Support & Shared Memory Workaround `flashinfer/cute_dsl/blockscaled_gemm.py`	Extended supported SM versions from `"sm_100"` to `["sm_100", "sm_103"]` in kernel constructor. Added hard-coded fallback to `"sm_100"` for shared memory capacity calculation when `sm_version` is `"sm_103"`, with explanatory comments noting this as a temporary workaround.
Dependency Update `requirements.txt`	Updated `nvidia-cutlass-dsl` minimum version requirement from `>=4.2.1` to `>=4.3.1`.
Test Device Capability Refactoring `tests/gemm/test_cute_dsl_blockscaled_gemm.py`	Replaced individual device capability major/minor extraction with device version tuple approach. Refactored skip condition to check against a defined list of supported device versions, with updated error message displaying supported and actual device versions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Attention point: The hard-coded fallback for sm_103 shared memory capacity in blockscaled_gemm.py is noted as temporary and should be verified for correctness and follow-up removal timeline
Attention point: Confirm the device version list in tests covers intended SM architectures and matches kernel support matrix
Attention point: Verify CUTLASS DSL version bump (>=4.3.1) is compatible with all downstream code

Suggested reviewers

yzh119
cyx-6
jiahanc

Poem

🐰 Hops with glee through SM versions new,
sm_103 joins the crew!
With shared memory tricks up my fuzzy sleeve,
CUTLASS DSL makes kernel dreams achieve. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: enabling SM103 MoE DSL backend support, which aligns with the key modifications across blockscaled_gemm.py, requirements.txt, and tests.
Description check	✅ Passed	The description includes the required sections with adequate information: a clear description of changes, pre-commit checklist completed, and tests verified as passing.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-11-28T08:55:34Z

Summary of Changes

Hello @aleozlx, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the FlashInfer library by enabling support for the SM103 GPU architecture within its MoE DSL backend. This change broadens the range of hardware compatible with the library's optimized operations, ensuring that users with newer NVIDIA GPUs can leverage the performance benefits. The update involves adjusting internal version checks, updating a key dependency, and modifying test configurations to validate the new support.

Highlights

SM103 Backend Support: The MoE DSL backend now officially supports the SM103 architecture, expanding compatibility beyond SM100.
Dependency Update: The nvidia-cutlass-dsl dependency has been bumped to version 4.3.1 to accommodate the new SM103 support.
Temporary SMEM Capacity Workaround: A temporary workaround has been implemented to use sm_100 for smem_capacity calculations when sm_103 is specified, addressing an upstream limitation in the cutlass utility.
Test Suite Expansion: The test suite has been updated to include SM103 as a supported device capability, ensuring proper validation of the new backend.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

aleozlx · 2025-11-28T08:55:47Z

/bot run

flashinfer-bot · 2025-11-28T08:56:40Z

GitLab MR !168 has been created, and the CI pipeline #39298794 is currently running. I'll report back once the pipeline job completes.

gemini-code-assist

Code Review

This pull request enables the sm103 moe dsl backend and updates the nvidia-cutlass-dsl dependency. The changes look good, especially the updates to the supported SM versions in the kernel and the tests. I have one suggestion regarding a temporary hack to improve its robustness and maintainability.

gemini-code-assist · 2025-11-28T08:57:27Z

flashinfer/cute_dsl/blockscaled_gemm.py

+        # HACK "sm_103" doesn't work yet for the query
+        # https://github.com/NVIDIA/cutlass/blob/5016493cc0d8650d5b2f6d2c2751cf49bc217e86/python/CuTeDSL/cutlass/utils/smem_allocator.py#L19
+        # self.smem_capacity = utils.get_smem_capacity_in_bytes(sm_version)
+        self.smem_capacity = utils.get_smem_capacity_in_bytes("sm_100")


While this hack works for the currently supported SM versions, it's a bit fragile. Hardcoding "sm_100" will be incorrect if another SM version is added in the future that is supported by get_smem_capacity_in_bytes. A more robust approach would be to only apply the fallback for sm_103. This also makes the intent clearer and the hack easier to remove. I'd also recommend adding a TODO to track this technical debt.

Suggested change

# HACK "sm_103" doesn't work yet for the query

# https://github.com/NVIDIA/cutlass/blob/5016493cc0d8650d5b2f6d2c2751cf49bc217e86/python/CuTeDSL/cutlass/utils/smem_allocator.py#L19

# self.smem_capacity = utils.get_smem_capacity_in_bytes(sm_version)

self.smem_capacity = utils.get_smem_capacity_in_bytes("sm_100")

# TODO: Remove this workaround once nvidia-cutlass-dsl supports sm_103 for smem capacity queries.

# HACK: "sm_103" is not yet supported by get_smem_capacity_in_bytes. Using "sm_100" as a fallback.

# See: https://github.com/NVIDIA/cutlass/blob/5016493cc0d8650d5b2f6d2c2751cf49bc217e86/python/CuTeDSL/cutlass/utils/smem_allocator.py#L19

smem_query_version = "sm_100" if sm_version == "sm_103" else sm_version

self.smem_capacity = utils.get_smem_capacity_in_bytes(smem_query_version)

this get_smem_capacity_in_bytes issue has been reported to cutlass team internally, if there is a quick turn around for this i'll just patch it later. also since this is just one kernel, if it works, it works, this is not wrong

I have no problem with this at the moment, considering sm_100 and sm_103 should have the same shared memory size.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

tests/gemm/test_cute_dsl_blockscaled_gemm.py (1)

83-88: Device capability gating correctly adds sm_103; consider de‑duplicating the source of truth

The new device_ver gating to [(10, 0), (10, 3)] matches the backend’s sm_100/sm_103 support and will correctly skip tests elsewhere. The only minor concern is that the list of supported SMs is now duplicated here and in Sm100BlockScaledPersistentDenseGemmKernel; if you expect to add more SMs (e.g., future Blackwell variants), consider exposing a small shared constant or helper so tests and the kernel derive their supported set from the same place to avoid drift.

flashinfer/cute_dsl/blockscaled_gemm.py (2)

532-535: sm_103 addition to supported SM list is consistent with grouped_gemm_nt_masked

Allowing both "sm_100" and "sm_103" here lines up with sm_version=f"sm_{major}{minor}" in grouped_gemm_nt_masked, so SM103 now passes the constructor guard as intended. If you foresee adding more SM variants, you might want to promote supported_sm_versions to a class- or module-level constant and import/reuse it from tests, so the supported set stays in sync across code paths.

566-570: SMEM capacity hack for sm_103 works but could be more future‑proof

Hard‑wiring self.smem_capacity to get_smem_capacity_in_bytes("sm_100") is a reasonable short‑term workaround to get sm_103 online, but it bakes in two assumptions:

That sm_103’s effective SMEM capacity is ≥ sm_100 (otherwise _compute_stages might choose too many stages).

That a future CUTLASS version will eventually support "sm_103" but this code will still be pointing at "sm_100" unless someone remembers to remove the hack.

To make this more robust, consider trying get_smem_capacity_in_bytes(sm_version) first and only falling back to "sm_100" when the query actually fails (and only for "sm_103"), with a clear TODO to remove once CUTLASS exposes the right value. Also worth double‑checking on target hardware that the sm_100 capacity is indeed a safe bound for sm_103.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b14408b and bdc65aa.

📒 Files selected for processing (3)

flashinfer/cute_dsl/blockscaled_gemm.py (2 hunks)
requirements.txt (1 hunks)
tests/gemm/test_cute_dsl_blockscaled_gemm.py (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

📚 Learning: 2025-11-12T03:35:17.583Z

Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

flashinfer/cute_dsl/blockscaled_gemm.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: build (cu130, amd64)
GitHub Check: build (cu126, amd64)
GitHub Check: build (cu129, arm64)
GitHub Check: Deploy Docs

🔇 Additional comments (1)

requirements.txt (1)

7-7: CUTLASS DSL version bump aligns with enabling the SM103 backend

Raising the minimum nvidia-cutlass-dsl to >=4.3.1 is consistent with adding SM103 cute-dsl support. Please just confirm this version is available in your target environments/CI images and matches the minimum required by the new SM103 path.

aleozlx · 2025-11-28T09:06:17Z

UT GB300

pytest tests/gemm/test_cute_dsl_blockscaled_gemm.py

256 passed, 128 xfailed, 768 warnings

yzh119 · 2025-11-28T20:23:49Z

/bot run

flashinfer-bot · 2025-11-28T20:23:56Z

GitLab MR !168 has been created, and the CI pipeline #39324091 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2025-11-29T02:18:28Z

[FAILED] Pipeline #39324091: 13/20 passed

yzh119

LGTM

yzh119 · 2025-11-29T07:33:37Z

flashinfer/cute_dsl/blockscaled_gemm.py

+        # HACK "sm_103" doesn't work yet for the query
+        # https://github.com/NVIDIA/cutlass/blob/5016493cc0d8650d5b2f6d2c2751cf49bc217e86/python/CuTeDSL/cutlass/utils/smem_allocator.py#L19
+        # self.smem_capacity = utils.get_smem_capacity_in_bytes(sm_version)
+        self.smem_capacity = utils.get_smem_capacity_in_bytes("sm_100")


I have no problem with this at the moment, considering sm_100 and sm_103 should have the same shared memory size.

enable sm103 moe dsl backend

bdc65aa

aleozlx requested review from kaixih and yzh119 as code owners November 28, 2025 08:55

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

coderabbitai bot reviewed Nov 28, 2025

View reviewed changes

yzh119 approved these changes Nov 29, 2025

View reviewed changes

-        # HACK "sm_103" doesn't work yet for the query
-        # https://github.com/NVIDIA/cutlass/blob/5016493cc0d8650d5b2f6d2c2751cf49bc217e86/python/CuTeDSL/cutlass/utils/smem_allocator.py#L19
-        # self.smem_capacity = utils.get_smem_capacity_in_bytes(sm_version)
-        self.smem_capacity = utils.get_smem_capacity_in_bytes("sm_100")
+        # TODO: Remove this workaround once nvidia-cutlass-dsl supports sm_103 for smem capacity queries.
+        # HACK: "sm_103" is not yet supported by get_smem_capacity_in_bytes. Using "sm_100" as a fallback.
+        # See: https://github.com/NVIDIA/cutlass/blob/5016493cc0d8650d5b2f6d2c2751cf49bc217e86/python/CuTeDSL/cutlass/utils/smem_allocator.py#L19
+        smem_query_version = "sm_100" if sm_version == "sm_103" else sm_version
+        self.smem_capacity = utils.get_smem_capacity_in_bytes(smem_query_version)

enable sm103 moe dsl backend #2149

Are you sure you want to change the base?

enable sm103 moe dsl backend #2149

Conversation

aleozlx commented Nov 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

aleozlx commented Nov 28, 2025

Uh oh!

flashinfer-bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

aleozlx Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

aleozlx commented Nov 28, 2025

Uh oh!

yzh119 commented Nov 28, 2025

Uh oh!

flashinfer-bot commented Nov 28, 2025

Uh oh!

flashinfer-bot commented Nov 29, 2025

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aleozlx commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 28, 2025 •

edited

Loading