Skip to content

WIP: MI300X kernarg debug — CDNA fixes, SNAP, AMD pool (partial)#60

Merged
Zaneham merged 3 commits intomasterfrom
mi300x-kernarg-debug
Mar 13, 2026
Merged

WIP: MI300X kernarg debug — CDNA fixes, SNAP, AMD pool (partial)#60
Zaneham merged 3 commits intomasterfrom
mi300x-kernarg-debug

Conversation

@Zaneham
Copy link
Copy Markdown
Owner

@Zaneham Zaneham commented Mar 11, 2026

Graph coloring regalloc merge (PR#53) + 3 CDNA fixes (rw_ops ordering, VGPR floor, rw_ops kind mapping). Kernel dispatches without fault/hang on MI300X but all 26 kernarg params read as zero from GPU side.

SNAP instrumentation added (--snap flag). kernarg_size clipping fixed. AMD memory pool API types added but not yet wired into dispatch — leading theory is hsa_memory_allocate gives memory SMEM can't read on MI300X, need hsa_amd_memory_pool_allocate instead.

Debug AQL packet dump still present in bc_dispatch. Clean up after the pool alloc theory is tested.

CI will fail this is a draft :-)

Graph coloring regalloc merge (PR#53) + 3 CDNA fixes (rw_ops ordering,
VGPR floor, rw_ops kind mapping). Kernel dispatches without fault/hang
on MI300X but all 26 kernarg params read as zero from GPU side.

SNAP instrumentation added (--snap flag). kernarg_size clipping fixed.
AMD memory pool API types added but not yet wired into dispatch —
leading theory is hsa_memory_allocate gives memory SMEM can't read
on MI300X, need hsa_amd_memory_pool_allocate instead.

Debug AQL packet dump still present in bc_dispatch. Clean up after
the pool alloc theory is tested.
@Zaneham Zaneham marked this pull request as draft March 11, 2026 08:13
Standard hsa_memory_allocate from HSA kernarg region gives memory
the scalar unit cannot read on MI300X — all params seen as zero.
HIP uses hsa_amd_memory_pool_allocate internally. Do the same.

Adds pool discovery callback, 4 AMD dlsym symbols, switches
bc_dispatch to amd_palloc/amd_pfree. Debug AQL dump removed.
@Zaneham Zaneham added the enhancement New feature or request label Mar 11, 2026
Eliminates 100% of VGPR spills on 654-line Monte Carlo transport
kernel (186 → 0). Total scratch traffic drops 78%. Instruction
count drops 28% (9,448 → 6,761).

The key insight from Sampaio et al. (2013, §6): on Wave64, spilling
a divergent VGPR costs 64 dwords of scratch per lane. Spilling a
uniform VGPR costs 1 dword via v_readfirstlane. The old allocator
treated all spills equally — the new one exploits the 64:1 cost ratio.

Algorithm: Cooper et al. (2001) dominator tree, Braun & Hack (2009)
loop-depth weighting, SSA liveness with PHI-aware dataflow,
divergence-weighted spill cost, greedy SSA coloring with
divergence-aware victim selection, 4-path spill codegen
(remat / uniform VGPR / divergent VGPR / SGPR), post-RA
phi elimination with free coalescing.

All static memory (~30 MB), no malloc. ~1,300 lines of C99.
Operates on SSA form before phi elimination.
Enabled via: barracuda --ssa-ra

90/91 tests pass (1 skipped, unchanged from baseline).
@Zaneham Zaneham marked this pull request as ready for review March 13, 2026 14:00
@Zaneham Zaneham merged commit 86d89d1 into master Mar 13, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant