WIP: MI300X kernarg debug — CDNA fixes, SNAP, AMD pool (partial)#60
Merged
WIP: MI300X kernarg debug — CDNA fixes, SNAP, AMD pool (partial)#60
Conversation
Graph coloring regalloc merge (PR#53) + 3 CDNA fixes (rw_ops ordering, VGPR floor, rw_ops kind mapping). Kernel dispatches without fault/hang on MI300X but all 26 kernarg params read as zero from GPU side. SNAP instrumentation added (--snap flag). kernarg_size clipping fixed. AMD memory pool API types added but not yet wired into dispatch — leading theory is hsa_memory_allocate gives memory SMEM can't read on MI300X, need hsa_amd_memory_pool_allocate instead. Debug AQL packet dump still present in bc_dispatch. Clean up after the pool alloc theory is tested.
Standard hsa_memory_allocate from HSA kernarg region gives memory the scalar unit cannot read on MI300X — all params seen as zero. HIP uses hsa_amd_memory_pool_allocate internally. Do the same. Adds pool discovery callback, 4 AMD dlsym symbols, switches bc_dispatch to amd_palloc/amd_pfree. Debug AQL dump removed.
Eliminates 100% of VGPR spills on 654-line Monte Carlo transport kernel (186 → 0). Total scratch traffic drops 78%. Instruction count drops 28% (9,448 → 6,761). The key insight from Sampaio et al. (2013, §6): on Wave64, spilling a divergent VGPR costs 64 dwords of scratch per lane. Spilling a uniform VGPR costs 1 dword via v_readfirstlane. The old allocator treated all spills equally — the new one exploits the 64:1 cost ratio. Algorithm: Cooper et al. (2001) dominator tree, Braun & Hack (2009) loop-depth weighting, SSA liveness with PHI-aware dataflow, divergence-weighted spill cost, greedy SSA coloring with divergence-aware victim selection, 4-path spill codegen (remat / uniform VGPR / divergent VGPR / SGPR), post-RA phi elimination with free coalescing. All static memory (~30 MB), no malloc. ~1,300 lines of C99. Operates on SSA form before phi elimination. Enabled via: barracuda --ssa-ra 90/91 tests pass (1 skipped, unchanged from baseline).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Graph coloring regalloc merge (PR#53) + 3 CDNA fixes (rw_ops ordering, VGPR floor, rw_ops kind mapping). Kernel dispatches without fault/hang on MI300X but all 26 kernarg params read as zero from GPU side.
SNAP instrumentation added (--snap flag). kernarg_size clipping fixed. AMD memory pool API types added but not yet wired into dispatch — leading theory is hsa_memory_allocate gives memory SMEM can't read on MI300X, need hsa_amd_memory_pool_allocate instead.
Debug AQL packet dump still present in bc_dispatch. Clean up after the pool alloc theory is tested.
CI will fail this is a draft :-)