Replace hardcoded CU/SM values with dynamic detection using HIP runtime #198

Copilot · 2025-10-06T00:21:22Z

Problem

The codebase had hardcoded values for the number of Compute Units (CUs) / Streaming Multiprocessors (SMs) throughout examples and benchmark scripts:

Total SMs: hardcoded to 304 (MI300X) or 104 (MI250)
GEMM SMs: hardcoded to 288, 256, or 304 depending on algorithm
Communication SMs: hardcoded to 48 or similar values

These hardcoded values prevented the code from working optimally on different GPU architectures without manual modification.

Solution

Replaced hardcoded values with dynamic SM allocation logic inlined at each call site. The logic uses get_cu_count() from the HIP runtime API and applies algorithm-specific calculations:

1. all_scatter algorithm: Uses total CU count

Example: 304 CUs → 304 SMs for GEMM

2. wg_specialized algorithm: Uses next smaller power of 2

Example: 304 CUs → 256 SMs for GEMM

3. all_reduce algorithm: Reserves ~1/3 of leftover CUs for communication

Example: 304 CUs → 288 SMs for GEMM (304 - (304-256)÷3 = 288)
This leaves 16 SMs for communication operations

Changes

Updated 11 files across the repository:

GEMM examples (07, 08, 09, 10, 11, 12): Changed default argument values from hardcoded numbers to None, with inlined runtime detection logic in worker functions
Benchmark scripts: Added auto-detection with fallback to partition-based heuristics
Utility scripts: Updated to use dynamic detection where possible

All command-line arguments still support manual override for tuning and testing purposes.

Implementation

The SM calculation logic is inlined at each call site rather than using a helper function, making the allocation strategy explicit and localized:

# For all_scatter
args["gemm_sms"] = cu_count

# For wg_specialized
args["gemm_sms"] = 2 ** int(math.log2(cu_count)) if cu_count > 0 else 1

# For all_reduce
next_pow2 = 2 ** int(math.log2(cu_count)) if cu_count > 0 else 1
leftover = cu_count - next_pow2
args["gemm_sms"] = cu_count - leftover // 3

Testing

Verified that formulas produce expected results:

MI300X (304 CUs):

all_scatter: 304 ✓
wg_specialized: 256 ✓
all_reduce: 288 ✓
comm_sms: 48 ✓

MI250 (104 CUs):

all_scatter: 104 ✓
wg_specialized: 64 ✓
all_reduce: 91 ✓
comm_sms: 40 ✓

All files pass ruff linting and Python syntax validation.

Benefits

Portability: Code now works across different AMD GPU architectures without modification
Maintainability: No need to update hardcoded values when new GPUs are released
Accuracy: Always uses actual hardware capability
Flexibility: Manual override still available via CLI arguments
Clarity: Inlined logic makes SM allocation strategy explicit at each usage site

Fixes #197

Original prompt

This section details on the original issue you should resolve

<issue_title>Do not use hardcoded number of CUs</issue_title>
<issue_description>The code base has hardcoded values from number of CUs (or SMs) of 304 and other places where we define gemm_sms to be the next smaller power of 2 number (e.g., 256 When the number of CUs is 304 then comm_sms is 48).

All these numbers need to be programmatically set (even for default argument values). Use the hip module to get the number of CUs.</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes #197

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

mawad-amd · 2025-10-06T00:44:00Z

iris/hip.py

    return cu_count.value


+def get_default_gemm_sms(device_id=None, algorithm="all_scatter"):


@copilot do not add new functions. Inline the logic at call site.

Removed the get_default_gemm_sms() function and inlined the SM calculation logic at each call site. Changes in commit 3b2e557.

…ction Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Initial plan

a3249a4

Copilot AI assigned Copilot and mawad-amd Oct 6, 2025

Copilot started work on behalf of mawad-amd October 6, 2025 00:22 View session

Add dynamic CU count detection and helper functions

af9d39c

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Do not use hardcoded number of CUs~~ Replace hardcoded CU/SM values with dynamic detection using HIP runtime Oct 6, 2025

Copilot AI requested a review from mawad-amd October 6, 2025 00:38

Copilot finished work on behalf of mawad-amd October 6, 2025 00:38

mawad-amd reviewed Oct 6, 2025

View reviewed changes

Copilot started work on behalf of mawad-amd October 6, 2025 00:44 View session

Inline SM calculation logic at call sites instead of using helper fun…

9381677

…ction Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Copilot AI requested a review from mawad-amd October 6, 2025 00:50

Copilot finished work on behalf of mawad-amd October 6, 2025 00:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace hardcoded CU/SM values with dynamic detection using HIP runtime #198

Replace hardcoded CU/SM values with dynamic detection using HIP runtime #198

Uh oh!

Copilot AI commented Oct 6, 2025 •

edited

Loading

Uh oh!

mawad-amd Oct 6, 2025

Uh oh!

Copilot AI Oct 6, 2025

Uh oh!

Uh oh!

		return cu_count.value


		def get_default_gemm_sms(device_id=None, algorithm="all_scatter"):

Replace hardcoded CU/SM values with dynamic detection using HIP runtime #198

Are you sure you want to change the base?

Replace hardcoded CU/SM values with dynamic detection using HIP runtime #198

Uh oh!

Conversation

Copilot AI commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Implementation

Testing

Benefits

Comments on the Issue (you are @copilot in this section)

Uh oh!

mawad-amd Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI commented Oct 6, 2025 •

edited

Loading