RDNA backend support and architecture-aware tests#109
Draft
RDNA backend support and architecture-aware tests#109
Conversation
Add per-hardware-block counter limits for RDNA2/3/4 backends so the profiler can bin-pack counters optimally instead of falling back to naive 6-per-pass chunking. Also document the provenance of all block limit values (ROCm/rocm-systems aqlprofile headers). - gfx1201: add _get_counter_groups() + _get_counter_block_limits() with 22 GFX12 blocks (SQG, SQC, GL2C, CHA, CHC, UTCL1, etc.) - gfx1151: inherits GFX12 limits from gfx1201 automatically - gfx1030 (new): RDNA2 backend with 23 GFX10 block limits - gfx1100 (new): RDNA3 backend with 23 GFX11 block limits - gfx90a, gfx942: add source provenance to block limits docstrings - __init__.py: register gfx1030-1032 and gfx1100-1103 aliases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fall back to /tmp/metrix for the gpu_query cache when $HOME/.cache is not writable (e.g. shared cluster nodes where $HOME is on a read-only NFS mount). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compile gpu_query.hip to a temp file instead of caching in $HOME/.cache/metrix. The cache caused PermissionError on cluster nodes where $HOME is on a read-only NFS mount. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Avoid recompiling gpu_query.hip on every get_backend() call by caching the compiled binary path in a module-level variable. One hipcc invocation per Python process instead of per backend. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
query_device_specs() used to raise on arch mismatch (e.g. requesting gfx1100 on a gfx1103 device). But get_backend() already maps aliases to the right backend class, so the strict check just breaks legitimate mappings like gfx1103 -> GFX1100Backend and gfx950 -> GFX942Backend. Now uses the arch reported by the hardware directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add gfx1030/gfx1100/gfx1103 definitions for metrics that have available hardware counters on RDNA: - compute.gpu_utilization (GRBM_GUI_ACTIVE / GRBM_COUNT) - memory.l2_hit_rate (GL2C_HIT / GL2C_MISS) - memory.l2_bandwidth (GL2C hits+misses * 128B cacheline) - memory.bytes_transferred_l2 (GL2C total * 128B) - memory.hbm_read_bandwidth (GL2C_EA_RDREQ 32B/64B/96B/128B) - memory.hbm_write_bandwidth (GL2C_MC_WRREQ / GL2C_EA_WRREQ_64B) - memory.hbm_bandwidth_utilization (read+write vs peak) - memory.bytes_transferred_hbm (total read+write bytes) - memory.lds_bank_conflicts (SQC_LDS_BANK_CONFLICT / SQC_LDS_IDX_ACTIVE) Metrics NOT ported (no counters on RDNA2/3): - compute.total_flops — no per-dtype VALU counters - compute.*_arithmetic_intensity — needs FLOPS - memory.l1_hit_rate — no TCP counters exposed - memory.coalescing_efficiency — needs TCP counters Counter names verified via rocprof --list-basic on real gfx1103 hw. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- API now filters unavailable metrics from profiles/presets with warnings instead of crashing. Explicit metric requests get a clear error listing available alternatives. Falls back to time-only mode when no metrics are available. - Add requires_metric() test decorator that skips based on actual backend metric availability (no hardcoded arch families). - Fix test_init_default hardcoded arch allowlist. - Integration tests skip gracefully when CDNA-only metrics (coalescing, FLOPS, arithmetic intensity, L1 hit rate) are unavailable on the current GPU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…alog The catalog is display metadata only. Actual computation lives in counter_defs.yaml and the backend implementations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardware validation
Test plan
example.pyruns end-to-end on each validated arch🤖 Generated with Claude Code