Add distributed launcher support for linex and metrix by mawad-amd · Pull Request #87 · AMDResearch/intellikit

mawad-amd · 2026-03-23T00:36:06Z

Summary

Add distributed profiling support to both linex and metrix with an explicit launcher parameter that ensures the correct command order (launcher rocprofv3 ... -- app instead of rocprofv3 ... -- launcher app)
Detect rank metadata (global_rank, local_rank, world_size, hostname, launcher) from environment variables set by torchrun, mpirun, srun, and horovodrun
Linex: rank-scoped output directories (rank0000/, rank0001/, ...), per-rank RankProfile objects, MCP per-rank hotspots
Metrix: rank metadata in ProfileResult/KernelResults/ProfilingResults, rank-suffixed output files (results.rank0003.json), --launcher CLI flag, rank columns in CSV/JSON/text output

API:

# Linex
profiler.profile(command="train.py", launcher="torchrun --nproc_per_node=8")

# Metrix Python
profiler.profile(command="train.py", launcher="mpirun -np 4")

# Metrix CLI
metrix profile --launcher "torchrun --nproc_per_node=8" -- train.py

Test plan

Unit tests for DistributedContext env detection (torchrun, mpirun, srun)
Unit tests for normalize_command_argv (string and sequence input)
Unit tests for apply_rank_suffix (with/without extension, single process)
Unit tests verifying correct command order with launcher (linex + metrix)
Unit tests verifying plain rocprofv3 without launcher
Unit tests for shlex parsing and rank field propagation in rocprof_wrapper
End-to-end smoke test with real GPUs and torchrun

🤖 Generated with Claude Code

- Add DistributedContext dataclass and env detection for torchrun, mpirun, srun, horovodrun - Linex: rank-scoped output dirs, RankProfile objects, MCP per-rank hotspots - Metrix: rank metadata in ProfileResult/KernelResults/ProfilingResults, rank-suffixed output files - CLI: argparse.REMAINDER for `-- launcher ...` syntax - Both: normalize_command_argv with shlex, accept str | Sequence[str] - Tests for distributed helpers, shlex parsing, rank field propagation Note: command construction is still rocprofv3-wraps-launcher (wrong order). Next step: fix to launcher-wraps-rocprofv3 for correct distributed profiling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previously, distributed commands like `torchrun --nproc_per_node=8 train.py` produced `rocprofv3 ... -- torchrun --nproc_per_node=8 train.py` which is wrong. rocprofv3 would profile the launcher process, not the per-rank GPU work. Now we split the command into launcher args and app args, producing: `torchrun --nproc_per_node=8 rocprofv3 ... -- train.py` The launcher spawns N processes, each running rocprofv3 around the app. Changes: - Add split_launcher_command() to both distributed.py modules - Handles torchrun, python -m torch.distributed.*, mpirun/mpiexec, srun, horovodrun - Update linex/api.py and metrix/rocprof_wrapper.py to use launcher wrapping - Add tests verifying correct command ordering for all launcher types Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of trying to parse launcher flags from a combined command string (fragile, requires hardcoded flag sets per launcher), let the user provide the launcher and app commands separately: # Python API profiler.profile(command="train.py", launcher="torchrun --nproc_per_node=8") # Metrix CLI metrix profile --launcher "torchrun --nproc_per_node=8" -- train.py This is unambiguous, works with any launcher (including custom ones), and requires no flag-parsing maintenance. - Remove split_launcher_command() and all _split_* helpers - Add launcher parameter to Linex.profile(), Metrix.profile(), ROCProfV3Wrapper.profile(), CounterBackend.profile(), all backend implementations, CLI (--launcher flag), and MCP tools - Update tests and READMEs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds first-class “distributed launcher” support to Linex and Metrix so profiling commands can be invoked as launcher rocprofv3 ... -- app, and introduces rank metadata propagation/suffixing utilities to avoid output clobbering.

Changes:

Add distributed context detection + argv normalization helpers (shlex-based) for both Metrix and Linex.
Extend Metrix/Linex APIs, CLI, and MCP tools with an explicit launcher parameter and propagate rank metadata into result objects/output.
Add unit tests covering env detection, argv normalization, rank suffixing, and launcher command ordering.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
metrix/tests/unit/test_rocprof_wrapper.py	Adds tests for shlex parsing, rank metadata propagation, and launcher command ordering.
metrix/tests/unit/test_distributed.py	New unit tests for distributed helpers (env detection, argv normalization, rank suffixing).
metrix/src/metrix/utils/distributed.py	New distributed helper module (context detection, shlex argv normalization, rank suffixing).
metrix/src/metrix/profiler/rocprof_wrapper.py	Accepts `launcher`/`env`, uses shlex parsing, and annotates `ProfileResult` with distributed metadata.
metrix/src/metrix/mcp/server.py	Adds `launcher` param to MCP tool and includes rank metadata in the response.
metrix/src/metrix/cli/profile_cmd.py	Adds `--launcher` plumbing, remainder target parsing normalization, and rank-aware output formatting/suffixing.
metrix/src/metrix/cli/main.py	Adds `--launcher` flag and switches target parsing to `argparse.REMAINDER`.
metrix/src/metrix/backends/gfx942.py	Updates backend signatures to accept `launcher` (but currently not forwarded).
metrix/src/metrix/backends/gfx90a.py	Updates backend signatures to accept `launcher` (but currently not forwarded).
metrix/src/metrix/backends/gfx1201.py	Updates backend signatures to accept `launcher` (but currently not forwarded).
metrix/src/metrix/backends/base.py	Adds rank fields to `ProfileResult`, adds `launcher` to API, and adds rank-prefixed aggregation keys.
metrix/src/metrix/api.py	Adds `launcher` support and rank metadata to `ProfilingResults`/`KernelResults`.
metrix/README.md	Documents distributed launcher usage and rank-suffixed outputs.
linex/tests/test_distributed_api.py	New tests for distributed helpers, rank-scoped output, deterministic ui dir choice, and launcher ordering.
linex/src/linex/mcp/server.py	Adds `launcher` plumbing + per-rank outputs to MCP responses (currently contains a syntax error).
linex/src/linex/distributed.py	New distributed helper module for Linex (context detection + argv normalization).
linex/src/linex/api.py	Adds distributed context tracking, rank-scoped output dirs, launcher support, and `RankProfile` aggregation.
linex/src/linex/init.py	Exports `RankProfile` in the public API.
linex/README.md	Documents distributed launcher usage and new distributed properties.

Comments suppressed due to low confidence (3)

metrix/src/metrix/backends/gfx942.py:107

_run_rocprof now takes launcher, but the implementation ignores it when calling ROCProfV3Wrapper.profile(...). This makes launcher a no-op for this backend. Pass launcher=launcher through to the wrapper call (and ensure callers forward it).

    def _run_rocprof(
        self,
        command: str | Sequence[str],
        counters: List[str],
        kernel_filter: Optional[str] = None,
        cwd: Optional[str] = None,
        launcher: Optional[str | Sequence[str]] = None,
        timeout_seconds: Optional[int] = 0,
    ) -> List[ProfileResult]:
        """Run rocprofv3 and return results (single pass only - base class handles multi-pass)"""
        wrapper = ROCProfV3Wrapper(timeout_seconds=timeout_seconds)
        return wrapper.profile(command, counters, kernel_filter=kernel_filter, cwd=cwd)

metrix/src/metrix/backends/gfx1201.py:74

_run_rocprof accepts launcher but does not forward it to ROCProfV3Wrapper.profile(...), so --launcher has no effect for gfx1201. Pass launcher=launcher to the wrapper call (and ensure the base class forwards it when invoking _run_rocprof).

    def _run_rocprof(
        self,
        command: str | Sequence[str],
        counters: List[str],
        kernel_filter: Optional[str] = None,
        cwd: Optional[str] = None,
        launcher: Optional[str | Sequence[str]] = None,
        timeout_seconds: Optional[int] = 0,
        kernel_iteration_range: Optional[str] = None,
    ) -> List[ProfileResult]:
        wrapper = ROCProfV3Wrapper(timeout_seconds=timeout_seconds)
        extra_counters_path = Path(__file__).parent / "counter_defs.yaml"

        return wrapper.profile(
            command=command,
            counters=counters,
            kernel_filter=kernel_filter,
            cwd=cwd,
            kernel_iteration_range=kernel_iteration_range,
            extra_counters_path=extra_counters_path if extra_counters_path.exists() else None,
            arch=self.device_specs.arch,
        )

metrix/src/metrix/backends/gfx90a.py:107

_run_rocprof now accepts a launcher parameter, but it isn’t passed down to ROCProfV3Wrapper.profile(...), so the launcher never affects the actual subprocess command. Pass launcher=launcher through to the wrapper call (and ensure the base class forwards it when invoking _run_rocprof).

    def _run_rocprof(
        self,
        command: str | Sequence[str],
        counters: List[str],
        kernel_filter: Optional[str] = None,
        cwd: Optional[str] = None,
        launcher: Optional[str | Sequence[str]] = None,
        timeout_seconds: Optional[int] = 0,
    ) -> List[ProfileResult]:
        """Run rocprofv3 and return results (single pass only - base class handles multi-pass)"""
        wrapper = ROCProfV3Wrapper(timeout_seconds=timeout_seconds)
        return wrapper.profile(command, counters, kernel_filter=kernel_filter, cwd=cwd)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

linex/src/linex/api.py

linex/src/linex/mcp/server.py

metrix/src/metrix/backends/base.py

metrix/src/metrix/cli/profile_cmd.py

linex/src/linex/api.py

metrix/src/metrix/cli/profile_cmd.py

… Python 3.8 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… batch path, clarify launcher semantics - Move profile_command docstring above code (was displaced) - Forward launcher param in CounterBackend recursive batch call - Add Note sections to Linex.profile() and Metrix.profile() explaining that launcher is for mpirun-style use, and for torchrun the correct pattern is running metrix/linex under torchrun (not the reverse) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The launcher was being prepended before rocprofv3, producing commands like `torchrun rocprofv3 ... -- app` which fails because torchrun expects Python scripts. The correct structure is `rocprofv3 ... -- torchrun ... app` so rocprofv3 traces the entire process tree including launcher-spawned workers. Also fixes launcher not being forwarded through the metrix call chain: - CounterBackend.profile() now passes launcher to _run_rocprof() - All three backends (gfx942, gfx90a, gfx1201) now forward launcher to ROCProfV3Wrapper.profile() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When rocprofv3 wraps a multi-process launcher (e.g. torchrun), the kernel trace CSV can contain rows with null timestamp fields. int(None) raises TypeError, which wasn't caught. Add TypeError to the exception handlers in both _parse_kernel_trace and _parse_output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update test assertions to expect rocprofv3 wrapping the launcher (rocprofv3 ... -- torchrun ... app) instead of the reverse. Run ruff format on touched files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When a launcher (e.g. torchrun) is specified, instead of wrapping the entire process tree under a single rocprofv3 instance (which merges all rank counters), we now generate a Python wrapper script that the launcher spawns per-worker. Each worker reads its RANK from the environment and runs its own rocprofv3 with a rank-specific output directory (rank_0/, rank_1/, etc.). After execution, the parent process collects and parses each rank's output independently, populating ProfileResult/RankProfile with correct per-rank metadata. This ensures hardware counters are properly attributed to individual ranks rather than being mixed across the distributed process tree. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fixes found during live testing on hpcfund MI250X with torchrun: 1. Remove spurious `python3` prefix from wrapper command — torchrun already invokes the script with Python, so including it caused torchrun to try to run `python3` as a Python script file. 2. Rewrite --input YAML per-rank to avoid rocprofv3 "conflicting value for output_directory" error — the YAML's output_directory field conflicted with the -d flag. The wrapper now creates a per-rank copy of the YAML with the rank-specific output directory. 3. Handle single-element argv containing spaces — torchrun passes quoted commands as one argv element via argparse.REMAINDER. Both normalize_command_argv() and the wrapper script now apply shlex.split() when a single-element list contains spaces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ision The merge key for combining counter data across profiling passes was (kernel_name, dispatch_id, run_id), which caused dispatches from different ranks with the same kernel name and dispatch ID to collide silently. In a 2-rank all-reduce, this meant rank 1's 330 dispatches were merged into rank 0's, losing per-rank resolution. Adding global_rank to the key tuple ensures each rank's dispatches are tracked independently through the multi-pass merge pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…launchers

The CLI was missing two things from PR #85: - dispatch_index in mismatch JSON output (field existed on ArrayMismatch but wasn't serialized by the CLI) - --atol, --rtol, --equal-nan flags (API had them but CLI only exposed legacy --tolerance) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mawad-amd and others added 3 commits March 22, 2026 19:23

Copilot AI review requested due to automatic review settings March 23, 2026 00:36

Copilot started reviewing on behalf of mawad-amd March 23, 2026 00:37 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

mawad-amd and others added 13 commits March 23, 2026 01:03

Fix lint: remove duplicate launcher kwarg, fix parenthesized with for…

cb1ee05

… Python 3.8 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix lint: add missing launcher param to analyze_instruction_hotspots

9b07645

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix ruff formatting

22f8fad

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix launcher forwarding in CounterBackend batch path

06ba4de

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix unit tests and ruff formatting for new launcher command order

510b79d

Update test assertions to expect rocprofv3 wrapping the launcher (rocprofv3 ... -- torchrun ... app) instead of the reverse. Run ruff format on touched files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into muhaawad/distributed-…

9a4e0f3

…launchers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed launcher support for linex and metrix#87

Add distributed launcher support for linex and metrix#87
mawad-amd wants to merge 16 commits intomainfrom
muhaawad/distributed-launchers

mawad-amd commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mawad-amd commented Mar 23, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants