WIP: MI300X kernarg debug — CDNA fixes, SNAP, AMD pool (partial) by Zaneham · Pull Request #60 · Zaneham/BarraCUDA

Zaneham · 2026-03-11T08:11:38Z

Graph coloring regalloc merge (PR#53) + 3 CDNA fixes (rw_ops ordering, VGPR floor, rw_ops kind mapping). Kernel dispatches without fault/hang on MI300X but all 26 kernarg params read as zero from GPU side.

SNAP instrumentation added (--snap flag). kernarg_size clipping fixed. AMD memory pool API types added but not yet wired into dispatch — leading theory is hsa_memory_allocate gives memory SMEM can't read on MI300X, need hsa_amd_memory_pool_allocate instead.

Debug AQL packet dump still present in bc_dispatch. Clean up after the pool alloc theory is tested.

CI will fail this is a draft :-)

Graph coloring regalloc merge (PR#53) + 3 CDNA fixes (rw_ops ordering, VGPR floor, rw_ops kind mapping). Kernel dispatches without fault/hang on MI300X but all 26 kernarg params read as zero from GPU side. SNAP instrumentation added (--snap flag). kernarg_size clipping fixed. AMD memory pool API types added but not yet wired into dispatch — leading theory is hsa_memory_allocate gives memory SMEM can't read on MI300X, need hsa_amd_memory_pool_allocate instead. Debug AQL packet dump still present in bc_dispatch. Clean up after the pool alloc theory is tested.

Standard hsa_memory_allocate from HSA kernarg region gives memory the scalar unit cannot read on MI300X — all params seen as zero. HIP uses hsa_amd_memory_pool_allocate internally. Do the same. Adds pool discovery callback, 4 AMD dlsym symbols, switches bc_dispatch to amd_palloc/amd_pfree. Debug AQL dump removed.

Eliminates 100% of VGPR spills on 654-line Monte Carlo transport kernel (186 → 0). Total scratch traffic drops 78%. Instruction count drops 28% (9,448 → 6,761). The key insight from Sampaio et al. (2013, §6): on Wave64, spilling a divergent VGPR costs 64 dwords of scratch per lane. Spilling a uniform VGPR costs 1 dword via v_readfirstlane. The old allocator treated all spills equally — the new one exploits the 64:1 cost ratio. Algorithm: Cooper et al. (2001) dominator tree, Braun & Hack (2009) loop-depth weighting, SSA liveness with PHI-aware dataflow, divergence-weighted spill cost, greedy SSA coloring with divergence-aware victim selection, 4-path spill codegen (remat / uniform VGPR / divergent VGPR / SGPR), post-RA phi elimination with free coalescing. All static memory (~30 MB), no malloc. ~1,300 lines of C99. Operates on SSA form before phi elimination. Enabled via: barracuda --ssa-ra 90/91 tests pass (1 skipped, unchanged from baseline).

Zaneham marked this pull request as draft March 11, 2026 08:13

Zaneham added the enhancement New feature or request label Mar 11, 2026

Zaneham marked this pull request as ready for review March 13, 2026 14:00

Zaneham merged commit 86d89d1 into master Mar 13, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: MI300X kernarg debug — CDNA fixes, SNAP, AMD pool (partial)#60

WIP: MI300X kernarg debug — CDNA fixes, SNAP, AMD pool (partial)#60
Zaneham merged 3 commits intomasterfrom
mi300x-kernarg-debug

Zaneham commented Mar 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Zaneham commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Zaneham commented Mar 11, 2026 •

edited

Loading