Skip to content

Add samplex: GPU PC sampling tool#90

Draft
mawad-amd wants to merge 16 commits intomainfrom
muhaawad/samplex
Draft

Add samplex: GPU PC sampling tool#90
mawad-amd wants to merge 16 commits intomainfrom
muhaawad/samplex

Conversation

@mawad-amd
Copy link
Copy Markdown
Member

Summary

  • Adds samplex, a new IntelliKit tool that wraps rocprofv3 PC sampling to find instruction-level GPU hotspots and stall reasons
  • Supports two sampling methods: stochastic (hardware-based, cycle-accurate, MI300+) and host_trap (software-based, time-based, MI200+)
  • Provides CLI (samplex), Python API (from samplex import Samplex), and MCP server (samplex-mcp)

What it does

Samplex answers: "Where is my kernel stuck and why?"

Given a GPU application, it runs rocprofv3 PC sampling and reports:

  • Per-kernel instruction hotspots (which opcodes the GPU spends the most time on)
  • Stall reasons (WAITCNT, BARRIER_WAIT, ALU_DEPENDENCY, etc.) — stochastic only
  • Issued vs stalled breakdown per instruction — stochastic only
  • Exec mask divergence and wave counts

Files

Path Description
samplex/pyproject.toml Package config, entry points
samplex/src/samplex/api.py High-level API (Samplex.sample())
samplex/src/samplex/profiler/rocprof_wrapper.py rocprofv3 wrapper, CSV parsing
samplex/src/samplex/cli/main.py CLI with profile and list-configs commands
samplex/src/samplex/mcp/server.py FastMCP server with pc_sample tool
samplex/src/samplex/logger.py Logger following IntelliKit patterns
samplex/skill/SKILL.md Claude Code skill definition
samplex/tests/unit/test_api.py 10 unit tests for analysis logic
samplex/tests/unit/test_rocprof_wrapper.py 6 unit tests for CSV parsing

Test plan

  • 16 unit tests pass (no GPU required)
  • E2E tested with stochastic method on MI355X (gfx950) — 16K+ samples
  • E2E tested with host_trap method on MI355X — 416 samples at interval=1000ns
  • Both methods produce correct output for torch.mm GEMM kernel
  • Test on MI300X (gfx942)
  • Test host_trap on MI200 series

🤖 Generated with Claude Code

mawad-amd and others added 16 commits March 25, 2026 01:57
Samplex provides statistical instruction-level profiling via rocprofv3
PC sampling. Supports both host_trap (time-based, MI200+) and stochastic
(cycle-based with stall reasons, MI300+) methods.

Includes CLI, Python API, MCP server, and unit tests following the
existing IntelliKit tool patterns (metrix/linex/etc).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Simplify samplex to only use stochastic (hardware-based) sampling.
Stochastic provides stall reasons, instruction types, wave counts,
and zero sampling skid. Requires MI300+ which is our target anyway.

Removes --method and --unit CLI flags. All fields (wave_issued,
stall_reason, instruction_type, wave_count) are now always present.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Low intervals (256-1024) cause rocprofv3 to silently drop all samples
due to overhead. 65536 gives ~76K samples on typical workloads while
keeping overhead reasonable. Users can lower to 4096 for more samples.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Support both PC sampling methods:
- stochastic (default): cycle-accurate, MI300+, provides stall reasons
- host_trap: time-based, MI200+, broader GPU support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…EADME

- Example script now uses format_text_output() instead of manually
  iterating over API results and printing custom output
- Moved the full example output from the main README into the example
  README where it belongs
- Main README now links to the example directory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Match the pattern used by linex and metrix examples: iterate over
results objects and print fields directly. The API data objects have
clear names (kernel.name, kernel.issued_pct, hotspot.opcode, etc.)
that are readable by both humans and LLMs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The expected output section was still showing the old format_text_output
format. Updated to match what the script actually prints: opcodes with
[issued=, stalled=] tags, no global breakdown section.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Stall reasons are already available on each InstructionHotspot via
stall_reasons dict. The kernel-level aggregation was redundant
processing — let consumers iterate and aggregate if needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Example now shows empty_instruction_count (holes) when > 0
- MCP server now returns instruction (full text), stall_reasons,
  instruction_types, and empty_instruction_count per kernel

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- README.md: add samplex to tools table, MCP config, install examples
- AGENTS.md: add samplex to tool descriptions, build commands, testing,
  package layout, MCP servers, skills, CI sections
- install/tools/install.sh: add samplex to ALL_TOOLS
- install/skills/install.sh: add samplex to TOOLS
- intellikit-ci-test.yml: add samplex to change detection and test matrix
- intellikit-pytest.yml: add samplex to change detection and test matrix

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rocprofv3's --kernel-include-regex only affects counter-collection and
thread-trace data, not PC sampling. Move filtering to the API layer
where it's applied as a regex match against kernel names after samples
are grouped by dispatch ID.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant