Skip to content

RDNA backend support and architecture-aware tests#109

Draft
mawad-amd wants to merge 13 commits intomainfrom
muhaawad/rdna
Draft

RDNA backend support and architecture-aware tests#109
mawad-amd wants to merge 13 commits intomainfrom
muhaawad/rdna

Conversation

@mawad-amd
Copy link
Copy Markdown
Member

@mawad-amd mawad-amd commented Apr 1, 2026

Summary

  • Add RDNA2 (gfx1030) and RDNA3 (gfx1100) backends with counter block limits and YAML metric definitions
  • Add GFX12 counter block limits to the existing RDNA4 (gfx1201) backend
  • Make tests and the profiling API architecture-aware — profiles silently filter unavailable metrics instead of crashing, explicit metric requests give clear errors
  • Document counter block limit provenance (ROCm/rocm-systems aqlprofile headers)
  • Fix device_info.py to work on non-writable home dirs and cache compiled binaries in-process

Hardware validation

Arch GPU Status
gfx942 MI300X done
gfx942 MI325X done
gfx950 MI350X done
gfx950 MI355X done
gfx1030 Radeon PRO V620 done
gfx1103 Radeon 780M done
gfx1151 Radeon (RDNA3)
gfx1201 Radeon AI PRO R9700 done

Test plan

  • Unit tests pass on gfx942, gfx1103, gfx1030
  • Integration tests pass (minus timeouts on slow iGPUs)
  • example.py runs end-to-end on each validated arch
  • Test on gfx1151

🤖 Generated with Claude Code

mawad-amd and others added 13 commits March 31, 2026 23:16
Add per-hardware-block counter limits for RDNA2/3/4 backends so
the profiler can bin-pack counters optimally instead of falling
back to naive 6-per-pass chunking. Also document the provenance
of all block limit values (ROCm/rocm-systems aqlprofile headers).

- gfx1201: add _get_counter_groups() + _get_counter_block_limits()
  with 22 GFX12 blocks (SQG, SQC, GL2C, CHA, CHC, UTCL1, etc.)
- gfx1151: inherits GFX12 limits from gfx1201 automatically
- gfx1030 (new): RDNA2 backend with 23 GFX10 block limits
- gfx1100 (new): RDNA3 backend with 23 GFX11 block limits
- gfx90a, gfx942: add source provenance to block limits docstrings
- __init__.py: register gfx1030-1032 and gfx1100-1103 aliases

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fall back to /tmp/metrix for the gpu_query cache when
$HOME/.cache is not writable (e.g. shared cluster nodes
where $HOME is on a read-only NFS mount).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compile gpu_query.hip to a temp file instead of caching in
$HOME/.cache/metrix. The cache caused PermissionError on cluster
nodes where $HOME is on a read-only NFS mount.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Avoid recompiling gpu_query.hip on every get_backend() call
by caching the compiled binary path in a module-level variable.
One hipcc invocation per Python process instead of per backend.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
query_device_specs() used to raise on arch mismatch (e.g.
requesting gfx1100 on a gfx1103 device). But get_backend()
already maps aliases to the right backend class, so the strict
check just breaks legitimate mappings like gfx1103 -> GFX1100Backend
and gfx950 -> GFX942Backend.

Now uses the arch reported by the hardware directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add gfx1030/gfx1100/gfx1103 definitions for metrics that have
available hardware counters on RDNA:

- compute.gpu_utilization (GRBM_GUI_ACTIVE / GRBM_COUNT)
- memory.l2_hit_rate (GL2C_HIT / GL2C_MISS)
- memory.l2_bandwidth (GL2C hits+misses * 128B cacheline)
- memory.bytes_transferred_l2 (GL2C total * 128B)
- memory.hbm_read_bandwidth (GL2C_EA_RDREQ 32B/64B/96B/128B)
- memory.hbm_write_bandwidth (GL2C_MC_WRREQ / GL2C_EA_WRREQ_64B)
- memory.hbm_bandwidth_utilization (read+write vs peak)
- memory.bytes_transferred_hbm (total read+write bytes)
- memory.lds_bank_conflicts (SQC_LDS_BANK_CONFLICT / SQC_LDS_IDX_ACTIVE)

Metrics NOT ported (no counters on RDNA2/3):
- compute.total_flops — no per-dtype VALU counters
- compute.*_arithmetic_intensity — needs FLOPS
- memory.l1_hit_rate — no TCP counters exposed
- memory.coalescing_efficiency — needs TCP counters

Counter names verified via rocprof --list-basic on real gfx1103 hw.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- API now filters unavailable metrics from profiles/presets with
  warnings instead of crashing. Explicit metric requests get a
  clear error listing available alternatives. Falls back to
  time-only mode when no metrics are available.
- Add requires_metric() test decorator that skips based on actual
  backend metric availability (no hardcoded arch families).
- Fix test_init_default hardcoded arch allowlist.
- Integration tests skip gracefully when CDNA-only metrics
  (coalescing, FLOPS, arithmetic intensity, L1 hit rate) are
  unavailable on the current GPU.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…alog

The catalog is display metadata only. Actual computation lives in
counter_defs.yaml and the backend implementations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant