Skip to content

Add distributed launcher support for linex and metrix#87

Open
mawad-amd wants to merge 16 commits intomainfrom
muhaawad/distributed-launchers
Open

Add distributed launcher support for linex and metrix#87
mawad-amd wants to merge 16 commits intomainfrom
muhaawad/distributed-launchers

Conversation

@mawad-amd
Copy link
Copy Markdown
Member

Summary

  • Add distributed profiling support to both linex and metrix with an explicit launcher parameter that ensures the correct command order (launcher rocprofv3 ... -- app instead of rocprofv3 ... -- launcher app)
  • Detect rank metadata (global_rank, local_rank, world_size, hostname, launcher) from environment variables set by torchrun, mpirun, srun, and horovodrun
  • Linex: rank-scoped output directories (rank0000/, rank0001/, ...), per-rank RankProfile objects, MCP per-rank hotspots
  • Metrix: rank metadata in ProfileResult/KernelResults/ProfilingResults, rank-suffixed output files (results.rank0003.json), --launcher CLI flag, rank columns in CSV/JSON/text output

API:

# Linex
profiler.profile(command="train.py", launcher="torchrun --nproc_per_node=8")

# Metrix Python
profiler.profile(command="train.py", launcher="mpirun -np 4")

# Metrix CLI
metrix profile --launcher "torchrun --nproc_per_node=8" -- train.py

Test plan

  • Unit tests for DistributedContext env detection (torchrun, mpirun, srun)
  • Unit tests for normalize_command_argv (string and sequence input)
  • Unit tests for apply_rank_suffix (with/without extension, single process)
  • Unit tests verifying correct command order with launcher (linex + metrix)
  • Unit tests verifying plain rocprofv3 without launcher
  • Unit tests for shlex parsing and rank field propagation in rocprof_wrapper
  • End-to-end smoke test with real GPUs and torchrun

🤖 Generated with Claude Code

mawad-amd and others added 3 commits March 22, 2026 19:23
- Add DistributedContext dataclass and env detection for torchrun, mpirun, srun, horovodrun
- Linex: rank-scoped output dirs, RankProfile objects, MCP per-rank hotspots
- Metrix: rank metadata in ProfileResult/KernelResults/ProfilingResults, rank-suffixed output files
- CLI: argparse.REMAINDER for `-- launcher ...` syntax
- Both: normalize_command_argv with shlex, accept str | Sequence[str]
- Tests for distributed helpers, shlex parsing, rank field propagation

Note: command construction is still rocprofv3-wraps-launcher (wrong order).
Next step: fix to launcher-wraps-rocprofv3 for correct distributed profiling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, distributed commands like `torchrun --nproc_per_node=8 train.py`
produced `rocprofv3 ... -- torchrun --nproc_per_node=8 train.py` which is wrong.
rocprofv3 would profile the launcher process, not the per-rank GPU work.

Now we split the command into launcher args and app args, producing:
`torchrun --nproc_per_node=8 rocprofv3 ... -- train.py`

The launcher spawns N processes, each running rocprofv3 around the app.

Changes:
- Add split_launcher_command() to both distributed.py modules
- Handles torchrun, python -m torch.distributed.*, mpirun/mpiexec, srun, horovodrun
- Update linex/api.py and metrix/rocprof_wrapper.py to use launcher wrapping
- Add tests verifying correct command ordering for all launcher types

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of trying to parse launcher flags from a combined command string
(fragile, requires hardcoded flag sets per launcher), let the user provide
the launcher and app commands separately:

  # Python API
  profiler.profile(command="train.py", launcher="torchrun --nproc_per_node=8")

  # Metrix CLI
  metrix profile --launcher "torchrun --nproc_per_node=8" -- train.py

This is unambiguous, works with any launcher (including custom ones), and
requires no flag-parsing maintenance.

- Remove split_launcher_command() and all _split_* helpers
- Add launcher parameter to Linex.profile(), Metrix.profile(),
  ROCProfV3Wrapper.profile(), CounterBackend.profile(), all backend
  implementations, CLI (--launcher flag), and MCP tools
- Update tests and READMEs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 23, 2026 00:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class “distributed launcher” support to Linex and Metrix so profiling commands can be invoked as launcher rocprofv3 ... -- app, and introduces rank metadata propagation/suffixing utilities to avoid output clobbering.

Changes:

  • Add distributed context detection + argv normalization helpers (shlex-based) for both Metrix and Linex.
  • Extend Metrix/Linex APIs, CLI, and MCP tools with an explicit launcher parameter and propagate rank metadata into result objects/output.
  • Add unit tests covering env detection, argv normalization, rank suffixing, and launcher command ordering.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
metrix/tests/unit/test_rocprof_wrapper.py Adds tests for shlex parsing, rank metadata propagation, and launcher command ordering.
metrix/tests/unit/test_distributed.py New unit tests for distributed helpers (env detection, argv normalization, rank suffixing).
metrix/src/metrix/utils/distributed.py New distributed helper module (context detection, shlex argv normalization, rank suffixing).
metrix/src/metrix/profiler/rocprof_wrapper.py Accepts launcher/env, uses shlex parsing, and annotates ProfileResult with distributed metadata.
metrix/src/metrix/mcp/server.py Adds launcher param to MCP tool and includes rank metadata in the response.
metrix/src/metrix/cli/profile_cmd.py Adds --launcher plumbing, remainder target parsing normalization, and rank-aware output formatting/suffixing.
metrix/src/metrix/cli/main.py Adds --launcher flag and switches target parsing to argparse.REMAINDER.
metrix/src/metrix/backends/gfx942.py Updates backend signatures to accept launcher (but currently not forwarded).
metrix/src/metrix/backends/gfx90a.py Updates backend signatures to accept launcher (but currently not forwarded).
metrix/src/metrix/backends/gfx1201.py Updates backend signatures to accept launcher (but currently not forwarded).
metrix/src/metrix/backends/base.py Adds rank fields to ProfileResult, adds launcher to API, and adds rank-prefixed aggregation keys.
metrix/src/metrix/api.py Adds launcher support and rank metadata to ProfilingResults/KernelResults.
metrix/README.md Documents distributed launcher usage and rank-suffixed outputs.
linex/tests/test_distributed_api.py New tests for distributed helpers, rank-scoped output, deterministic ui dir choice, and launcher ordering.
linex/src/linex/mcp/server.py Adds launcher plumbing + per-rank outputs to MCP responses (currently contains a syntax error).
linex/src/linex/distributed.py New distributed helper module for Linex (context detection + argv normalization).
linex/src/linex/api.py Adds distributed context tracking, rank-scoped output dirs, launcher support, and RankProfile aggregation.
linex/src/linex/init.py Exports RankProfile in the public API.
linex/README.md Documents distributed launcher usage and new distributed properties.
Comments suppressed due to low confidence (3)

metrix/src/metrix/backends/gfx942.py:107

  • _run_rocprof now takes launcher, but the implementation ignores it when calling ROCProfV3Wrapper.profile(...). This makes launcher a no-op for this backend. Pass launcher=launcher through to the wrapper call (and ensure callers forward it).
    def _run_rocprof(
        self,
        command: str | Sequence[str],
        counters: List[str],
        kernel_filter: Optional[str] = None,
        cwd: Optional[str] = None,
        launcher: Optional[str | Sequence[str]] = None,
        timeout_seconds: Optional[int] = 0,
    ) -> List[ProfileResult]:
        """Run rocprofv3 and return results (single pass only - base class handles multi-pass)"""
        wrapper = ROCProfV3Wrapper(timeout_seconds=timeout_seconds)
        return wrapper.profile(command, counters, kernel_filter=kernel_filter, cwd=cwd)

metrix/src/metrix/backends/gfx1201.py:74

  • _run_rocprof accepts launcher but does not forward it to ROCProfV3Wrapper.profile(...), so --launcher has no effect for gfx1201. Pass launcher=launcher to the wrapper call (and ensure the base class forwards it when invoking _run_rocprof).
    def _run_rocprof(
        self,
        command: str | Sequence[str],
        counters: List[str],
        kernel_filter: Optional[str] = None,
        cwd: Optional[str] = None,
        launcher: Optional[str | Sequence[str]] = None,
        timeout_seconds: Optional[int] = 0,
        kernel_iteration_range: Optional[str] = None,
    ) -> List[ProfileResult]:
        wrapper = ROCProfV3Wrapper(timeout_seconds=timeout_seconds)
        extra_counters_path = Path(__file__).parent / "counter_defs.yaml"

        return wrapper.profile(
            command=command,
            counters=counters,
            kernel_filter=kernel_filter,
            cwd=cwd,
            kernel_iteration_range=kernel_iteration_range,
            extra_counters_path=extra_counters_path if extra_counters_path.exists() else None,
            arch=self.device_specs.arch,
        )

metrix/src/metrix/backends/gfx90a.py:107

  • _run_rocprof now accepts a launcher parameter, but it isn’t passed down to ROCProfV3Wrapper.profile(...), so the launcher never affects the actual subprocess command. Pass launcher=launcher through to the wrapper call (and ensure the base class forwards it when invoking _run_rocprof).
    def _run_rocprof(
        self,
        command: str | Sequence[str],
        counters: List[str],
        kernel_filter: Optional[str] = None,
        cwd: Optional[str] = None,
        launcher: Optional[str | Sequence[str]] = None,
        timeout_seconds: Optional[int] = 0,
    ) -> List[ProfileResult]:
        """Run rocprofv3 and return results (single pass only - base class handles multi-pass)"""
        wrapper = ROCProfV3Wrapper(timeout_seconds=timeout_seconds)
        return wrapper.profile(command, counters, kernel_filter=kernel_filter, cwd=cwd)


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mawad-amd and others added 13 commits March 23, 2026 01:03
… Python 3.8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… batch path, clarify launcher semantics

- Move profile_command docstring above code (was displaced)
- Forward launcher param in CounterBackend recursive batch call
- Add Note sections to Linex.profile() and Metrix.profile() explaining
  that launcher is for mpirun-style use, and for torchrun the correct
  pattern is running metrix/linex under torchrun (not the reverse)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The launcher was being prepended before rocprofv3, producing commands like
`torchrun rocprofv3 ... -- app` which fails because torchrun expects Python
scripts. The correct structure is `rocprofv3 ... -- torchrun ... app` so
rocprofv3 traces the entire process tree including launcher-spawned workers.

Also fixes launcher not being forwarded through the metrix call chain:
- CounterBackend.profile() now passes launcher to _run_rocprof()
- All three backends (gfx942, gfx90a, gfx1201) now forward launcher to
  ROCProfV3Wrapper.profile()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When rocprofv3 wraps a multi-process launcher (e.g. torchrun), the
kernel trace CSV can contain rows with null timestamp fields. int(None)
raises TypeError, which wasn't caught. Add TypeError to the exception
handlers in both _parse_kernel_trace and _parse_output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update test assertions to expect rocprofv3 wrapping the launcher
(rocprofv3 ... -- torchrun ... app) instead of the reverse. Run
ruff format on touched files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a launcher (e.g. torchrun) is specified, instead of wrapping the
entire process tree under a single rocprofv3 instance (which merges all
rank counters), we now generate a Python wrapper script that the
launcher spawns per-worker. Each worker reads its RANK from the
environment and runs its own rocprofv3 with a rank-specific output
directory (rank_0/, rank_1/, etc.). After execution, the parent process
collects and parses each rank's output independently, populating
ProfileResult/RankProfile with correct per-rank metadata.

This ensures hardware counters are properly attributed to individual
ranks rather than being mixed across the distributed process tree.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes found during live testing on hpcfund MI250X with torchrun:

1. Remove spurious `python3` prefix from wrapper command — torchrun
   already invokes the script with Python, so including it caused
   torchrun to try to run `python3` as a Python script file.

2. Rewrite --input YAML per-rank to avoid rocprofv3 "conflicting value
   for output_directory" error — the YAML's output_directory field
   conflicted with the -d flag. The wrapper now creates a per-rank
   copy of the YAML with the rank-specific output directory.

3. Handle single-element argv containing spaces — torchrun passes
   quoted commands as one argv element via argparse.REMAINDER. Both
   normalize_command_argv() and the wrapper script now apply
   shlex.split() when a single-element list contains spaces.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ision

The merge key for combining counter data across profiling passes was
(kernel_name, dispatch_id, run_id), which caused dispatches from
different ranks with the same kernel name and dispatch ID to collide
silently. In a 2-rank all-reduce, this meant rank 1's 330 dispatches
were merged into rank 0's, losing per-rank resolution.

Adding global_rank to the key tuple ensures each rank's dispatches
are tracked independently through the multi-pass merge pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CLI was missing two things from PR #85:
- dispatch_index in mismatch JSON output (field existed on
  ArrayMismatch but wasn't serialized by the CLI)
- --atol, --rtol, --equal-nan flags (API had them but CLI
  only exposed legacy --tolerance)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants