Add samplex: GPU PC sampling tool by mawad-amd · Pull Request #90 · AMDResearch/intellikit

mawad-amd · 2026-03-25T10:35:43Z

Summary

Adds samplex, a new IntelliKit tool that wraps rocprofv3 PC sampling to find instruction-level GPU hotspots and stall reasons
Supports two sampling methods: stochastic (hardware-based, cycle-accurate, MI300+) and host_trap (software-based, time-based, MI200+)
Provides CLI (samplex), Python API (from samplex import Samplex), and MCP server (samplex-mcp)

What it does

Samplex answers: "Where is my kernel stuck and why?"

Given a GPU application, it runs rocprofv3 PC sampling and reports:

Per-kernel instruction hotspots (which opcodes the GPU spends the most time on)
Stall reasons (WAITCNT, BARRIER_WAIT, ALU_DEPENDENCY, etc.) — stochastic only
Issued vs stalled breakdown per instruction — stochastic only
Exec mask divergence and wave counts

Files

Path	Description
`samplex/pyproject.toml`	Package config, entry points
`samplex/src/samplex/api.py`	High-level API (`Samplex.sample()`)
`samplex/src/samplex/profiler/rocprof_wrapper.py`	rocprofv3 wrapper, CSV parsing
`samplex/src/samplex/cli/main.py`	CLI with `profile` and `list-configs` commands
`samplex/src/samplex/mcp/server.py`	FastMCP server with `pc_sample` tool
`samplex/src/samplex/logger.py`	Logger following IntelliKit patterns
`samplex/skill/SKILL.md`	Claude Code skill definition
`samplex/tests/unit/test_api.py`	10 unit tests for analysis logic
`samplex/tests/unit/test_rocprof_wrapper.py`	6 unit tests for CSV parsing

Test plan

16 unit tests pass (no GPU required)
E2E tested with stochastic method on MI355X (gfx950) — 16K+ samples
E2E tested with host_trap method on MI355X — 416 samples at interval=1000ns
Both methods produce correct output for torch.mm GEMM kernel
Test on MI300X (gfx942)
Test host_trap on MI200 series

🤖 Generated with Claude Code

Samplex provides statistical instruction-level profiling via rocprofv3 PC sampling. Supports both host_trap (time-based, MI200+) and stochastic (cycle-based with stall reasons, MI300+) methods. Includes CLI, Python API, MCP server, and unit tests following the existing IntelliKit tool patterns (metrix/linex/etc). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Simplify samplex to only use stochastic (hardware-based) sampling. Stochastic provides stall reasons, instruction types, wave counts, and zero sampling skid. Requires MI300+ which is our target anyway. Removes --method and --unit CLI flags. All fields (wave_issued, stall_reason, instruction_type, wave_count) are now always present. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Low intervals (256-1024) cause rocprofv3 to silently drop all samples due to overhead. 65536 gives ~76K samples on typical workloads while keeping overhead reasonable. Users can lower to 4096 for more samples. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Support both PC sampling methods: - stochastic (default): cycle-accurate, MI300+, provides stall reasons - host_trap: time-based, MI200+, broader GPU support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…EADME - Example script now uses format_text_output() instead of manually iterating over API results and printing custom output - Moved the full example output from the main README into the example README where it belongs - Main README now links to the example directory Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Match the pattern used by linex and metrix examples: iterate over results objects and print fields directly. The API data objects have clear names (kernel.name, kernel.issued_pct, hotspot.opcode, etc.) that are readable by both humans and LLMs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The expected output section was still showing the old format_text_output format. Updated to match what the script actually prints: opcodes with [issued=, stalled=] tags, no global breakdown section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Stall reasons are already available on each InstructionHotspot via stall_reasons dict. The kernel-level aggregation was redundant processing — let consumers iterate and aggregate if needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Example now shows empty_instruction_count (holes) when > 0 - MCP server now returns instruction (full text), stall_reasons, instruction_types, and empty_instruction_count per kernel Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- README.md: add samplex to tools table, MCP config, install examples - AGENTS.md: add samplex to tool descriptions, build commands, testing, package layout, MCP servers, skills, CI sections - install/tools/install.sh: add samplex to ALL_TOOLS - install/skills/install.sh: add samplex to TOOLS - intellikit-ci-test.yml: add samplex to change detection and test matrix - intellikit-pytest.yml: add samplex to change detection and test matrix Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rocprofv3's --kernel-include-regex only affects counter-collection and thread-trace data, not PC sampling. Move filtering to the API layer where it's applied as a regex match against kernel names after samples are grouped by dispatch ID. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mawad-amd and others added 16 commits March 25, 2026 01:57

Add host_trap sampling method alongside stochastic

1757a41

Support both PC sampling methods: - stochastic (default): cycle-accurate, MI300+, provides stall reasons - host_trap: time-based, MI200+, broader GPU support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add MIT license headers to all samplex files

3525504

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add example and fix lint issues

303cc64

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add README with example output and example README

c7d29b0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Apply ruff formatting to pass CI

bcaaf38

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update example README with verified output from Vultr

ceb3af8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add samplex: GPU PC sampling tool#90

Add samplex: GPU PC sampling tool#90
mawad-amd wants to merge 16 commits intomainfrom
muhaawad/samplex

mawad-amd commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mawad-amd commented Mar 25, 2026

Summary

What it does

Files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant