Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
rocm_trace_lite.json	rocm_trace_lite.json
rocm_trace_lite_default.json	rocm_trace_lite_default.json
rocprofv3_comprehensive.json	rocprofv3_comprehensive.json
rocprofv3_compute_bound.json	rocprofv3_compute_bound.json
rocprofv3_lightweight.json	rocprofv3_lightweight.json
rocprofv3_memory_bound.json	rocprofv3_memory_bound.json
rocprofv3_multi_gpu.json	rocprofv3_multi_gpu.json
rocprofv3_multi_node.json	rocprofv3_multi_node.json

Profiling configurations

This directory contains pre-configured profiling setups for madengine. Most files target ROCprofv3; rocm_trace_lite.json / rocm_trace_lite_default.json enable rocm-trace-lite (not rocprofv3—do not combine with rocprof / rocprofv3_* on the same run).

Available Profiles

1. Compute-Bound Profiling (`rocprofv3_compute_bound.json`)

Use Case: Models bottlenecked by ALU operations (e.g., large transformers with dense matrix operations)

Collected Metrics:

Wave execution and cycles
VALU (Vector ALU) instructions
SALU (Scalar ALU) instructions
Wait states
GPU power consumption

Usage:

madengine run --tags your_model \
  --additional-context-file examples/profiling-configs/rocprofv3_compute_bound.json

2. Memory-Bound Profiling (`rocprofv3_memory_bound.json`)

Use Case: Models bottlenecked by memory bandwidth (e.g., large batch sizes, high-resolution inputs)

Collected Metrics:

L1/L2 cache hit rates
Memory read/write requests
Cache efficiency
VRAM usage over time

Usage:

madengine run --tags your_model \
  --additional-context-file examples/profiling-configs/rocprofv3_memory_bound.json

3. Multi-GPU Profiling (`rocprofv3_multi_gpu.json`)

Use Case: Multi-GPU training with data parallel or model parallel

Collected Metrics:

RCCL communication traces
Inter-GPU memory transfers
Scratch memory allocation
Per-GPU power and VRAM

Usage:

madengine run --tags your_model \
  --additional-context-file examples/profiling-configs/rocprofv3_multi_gpu.json

4. Comprehensive Profiling (`rocprofv3_comprehensive.json`)

Use Case: Full analysis with all available metrics (high overhead!)

Collected Metrics:

All kernel traces (HIP, HSA, kernel, memory)
Hardware performance counters
Library call traces (MIOpen, rocBLAS)
Power and VRAM monitoring
Statistical summaries

Usage:

madengine run --tags your_model \
  --additional-context-file examples/profiling-configs/rocprofv3_comprehensive.json

Warning: This profile has significant overhead. Use for detailed analysis only.

5. Lightweight Profiling (`rocprofv3_lightweight.json`)

Use Case: Production-like workloads with minimal profiling overhead

Collected Metrics:

Basic HIP and kernel traces
JSON output format (compact)

Usage:

madengine run --tags your_model \
  --additional-context-file examples/profiling-configs/rocprofv3_lightweight.json

6. rocm-trace-lite (`rocm_trace_lite.json`, `rocm_trace_lite_default.json`)

Use Case: GPU kernel dispatch tracing without rocprofiler-sdk (SQLite output compatible with RPD-style tools). See the rocm-trace-lite documentation and Quick Start.

rocm_trace_lite.json — tool rocm_trace_lite: RTL lite mode (typically lower overhead).
rocm_trace_lite_default.json — tool rocm_trace_lite_default: RTL default mode (broader coverage; compare overhead vs lite).

Do not combine with rocprof / rocprofv3_* on the same run.

Requirements / notes:

madengine wraps the workload with rtl_trace_wrapper.sh and writes under rocm_trace_lite_output/ (see Profiling Guide).
On the first run, the trace pre-script installs rocm-trace-lite from a GitHub Release wheel (not PyPI). The container needs HTTPS access to GitHub, unless the wheel is already installed in the image.
Default install uses a pinned wheel URL in the trace pre-script. Set ROCM_TRACE_LITE_FOLLOW_LATEST=1 to pull the latest release via the GitHub API instead (needs curl). Override with ROCM_TRACE_LITE_WHEEL_URL (direct .whl URL) for air-gapped or custom platforms. Automation targets linux x86_64 wheels.

Usage:

madengine run --tags your_model \
  --additional-context-file examples/profiling-configs/rocm_trace_lite.json

madengine run --tags your_model \
  --additional-context-file examples/profiling-configs/rocm_trace_lite_default.json

7. Multi-Node Distributed (`rocprofv3_multi_node.json`)

Use Case: Large-scale distributed training on SLURM clusters

Collected Metrics:

RCCL communication patterns
Cross-node synchronization
Per-node power monitoring

Usage:

# Build phase
madengine build --tags your_model --registry your-registry:5000

# Deploy to SLURM
madengine run --manifest-file build_manifest.json \
  --additional-context-file examples/profiling-configs/rocprofv3_multi_node.json

Direct Tool Usage (Without Config Files)

Single GPU - Compute Analysis

madengine run --tags dummy_prof \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "tools": [{"name": "rocprofv3_compute"}]
  }'

Multi-GPU - Communication Analysis

madengine run --tags your_model \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "docker_gpus": "all",
    "distributed": {
      "launcher": "torchrun",
      "nproc_per_node": 8
    },
    "tools": [{"name": "rocprofv3_communication"}]
  }'

Custom ROCprofv3 Command

madengine run --tags your_model \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "tools": [{
      "name": "rocprof",
      "cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --hip-trace --kernel-trace --memory-copy-trace --output-format pftrace -d ./my_traces --",
      "env_vars": {
        "RCCL_DEBUG": "TRACE",
        "HSA_ENABLE_SDMA": "0"
      }
    }]
  }'

Best Practices for Custom Commands

Always Include the `--` Separator

When using custom profiling commands with rocprof_wrapper.sh, always include the trailing --:

{
  "name": "rocprof",
  "cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --sys-trace --"
}

Why? The -- separator is critical for rocprofv3 (ROCm >= 7.0):

rocprofv3 requires: rocprofv3 [options] -- <application>
rocprof (legacy) accepts: rocprof [options] <application>

The wrapper script auto-detects which profiler is available and formats the command correctly. Without the --, rocprofv3 will fail to parse arguments when the application command is appended.

❌ Wrong:

{"cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --sys-trace"}

✅ Correct:

{"cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --sys-trace --"}

Available ROCprofv3 Tools

Tool Name	Description	Key Options	Overhead
`rocprofv3_compute`	Compute-bound analysis	Counter collection, VALU/SALU metrics	Medium
`rocprofv3_memory`	Memory bandwidth analysis	Cache hits/misses, memory transfers	Medium
`rocprofv3_communication`	Multi-GPU communication	RCCL trace, scratch memory	Medium
`rocprofv3_full`	Comprehensive profiling	All traces + counters + stats	High
`rocprofv3_lightweight`	Minimal overhead	HIP + kernel trace only	Low
`rocprofv3_perfetto`	Perfetto visualization	Perfetto-compatible output	Medium
`rocprofv3_api_overhead`	API call analysis	HIP/HSA/marker traces with stats	Low
`rocprofv3_pc_sampling`	Kernel hotspot analysis	PC sampling at 1000 Hz	Medium

Other: rocm_trace_lite (RTL lite mode) and rocm_trace_lite_default (RTL default mode) — kernel dispatch SQLite trace via rocm-trace-lite, installed from GitHub Release wheels by the trace pre-script (not PyPI; see Profiling Guide). Not a rocprofv3 preset; do not combine with rocprof / rocprofv3_* on the same run.

Counter Definition Files

Counter files are located at src/madengine/scripts/common/tools/counters/:

compute_bound.txt: Wave execution, VALU/SALU instructions, wait states
memory_bound.txt: Cache metrics, memory controller traffic, LDS usage
communication_bound.txt: PCIe traffic, atomic operations, synchronization
full_profile.txt: Comprehensive set of all important metrics

You can create custom counter files and reference them in your profiling commands.

Output Files

After profiling, madengine writes outputs to the working directory:

rocprof_output/
├── <timestamp>/
│   ├── *_results.db          # ROCprofv3 database (SQLite)
│   ├── kernel_trace.csv      # Kernel execution traces
│   ├── hip_api_trace.csv     # HIP API calls
│   └── memory_copy_trace.csv # Memory transfers
├── model_trace.pftrace       # Perfetto format (if using rocprofv3_perfetto)
└── trace.json                # JSON format (if using rocprofv3_lightweight)

gpu_info_power_profiler_output.csv  # Power consumption over time
gpu_info_vram_profiler_output.csv   # VRAM usage over time
library_trace.csv                    # Library API calls (if library tracing enabled)

rocm_trace_lite_output/trace.db       # rocm-trace-lite (also trace.json.gz / trace_summary.txt as emitted by RTL)

Visualization

Perfetto UI (Recommended)

# If using rocprofv3_perfetto or output-format pftrace
# Upload files to https://ui.perfetto.dev/

Custom Analysis

import sqlite3
import pandas as pd

# Parse ROCprofv3 database
conn = sqlite3.connect('rocprof_output/<timestamp>/*_results.db')
kernels = pd.read_sql_query("SELECT * FROM kernels", conn)
print(kernels.head())

Best Practices

Start lightweight: Use rocprofv3_lightweight for initial profiling
Target your bottleneck: Use specific profiles (compute/memory/communication) based on initial findings
Avoid full profiling in production: rocprofv3_full adds 20-50% overhead
Multi-GPU: Always enable RCCL tracing for distributed workloads
Sampling rates: Reduce sampling rates for long-running jobs (e.g., 1.0 instead of 0.1)
Counter multiplexing: ROCprofv3 may need multiple runs if too many counters are requested

Troubleshooting

No output files generated

# Check if rocprofv3 is available
which rocprofv3
rocprofv3 --version

# Verify ROCm version (>= 7.0 recommended for rocprofv3)
rocm-smi --version

"Counter not available" errors

Some counters may not be available on all GPU architectures. Check available counters:

rocprofv3-avail

High overhead affecting results

Use rocprofv3_lightweight or reduce counter collection:

# Remove counter collection for minimal overhead
madengine run --tags your_model \
  --additional-context '{
    "tools": [{
      "name": "rocprof",
      "cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --hip-trace --kernel-trace --output-format json -d ./traces --"
    }]
  }'

Additional Resources

Examples

Example 1: Profile LLM Inference (Compute-Bound)

madengine run --tags pyt_vllm_llama2_7b \
  --additional-context-file examples/profiling-configs/rocprofv3_compute_bound.json

Example 2: Profile Multi-GPU Training (Communication-Bound)

madengine run --tags pyt_torchtitan_llama3_8b \
  --additional-context-file examples/profiling-configs/rocprofv3_multi_gpu.json

Example 3: Profile Image Model (Memory-Bound)

madengine run --tags pyt_torchvision_resnet50 \
  --additional-context-file examples/profiling-configs/rocprofv3_memory_bound.json

Example 4: Quick Test with Dummy Model

madengine run --tags dummy_prof \
  --additional-context-file examples/profiling-configs/rocprofv3_lightweight.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Profiling configurations

Available Profiles

1. Compute-Bound Profiling (`rocprofv3_compute_bound.json`)

2. Memory-Bound Profiling (`rocprofv3_memory_bound.json`)

3. Multi-GPU Profiling (`rocprofv3_multi_gpu.json`)

4. Comprehensive Profiling (`rocprofv3_comprehensive.json`)

5. Lightweight Profiling (`rocprofv3_lightweight.json`)

6. rocm-trace-lite (`rocm_trace_lite.json`, `rocm_trace_lite_default.json`)

7. Multi-Node Distributed (`rocprofv3_multi_node.json`)

Direct Tool Usage (Without Config Files)

Single GPU - Compute Analysis

Multi-GPU - Communication Analysis

Custom ROCprofv3 Command

Best Practices for Custom Commands

Always Include the `--` Separator

Available ROCprofv3 Tools

Counter Definition Files

Output Files

Visualization

Perfetto UI (Recommended)

Custom Analysis

Best Practices

Troubleshooting

No output files generated

"Counter not available" errors

High overhead affecting results

Additional Resources

Examples

Example 1: Profile LLM Inference (Compute-Bound)

Example 2: Profile Multi-GPU Training (Communication-Bound)

Example 3: Profile Image Model (Memory-Bound)

Example 4: Quick Test with Dummy Model

FilesExpand file tree

profiling-configs

Directory actions

More options

Directory actions

More options

Latest commit

History

profiling-configs

Folders and files

parent directory

README.md

Profiling configurations

Available Profiles

1. Compute-Bound Profiling (rocprofv3_compute_bound.json)

2. Memory-Bound Profiling (rocprofv3_memory_bound.json)

3. Multi-GPU Profiling (rocprofv3_multi_gpu.json)

4. Comprehensive Profiling (rocprofv3_comprehensive.json)

5. Lightweight Profiling (rocprofv3_lightweight.json)

6. rocm-trace-lite (rocm_trace_lite.json, rocm_trace_lite_default.json)

7. Multi-Node Distributed (rocprofv3_multi_node.json)

Direct Tool Usage (Without Config Files)

Single GPU - Compute Analysis

Multi-GPU - Communication Analysis

Custom ROCprofv3 Command

Best Practices for Custom Commands

Always Include the -- Separator

Available ROCprofv3 Tools

Counter Definition Files

Output Files

Visualization

Perfetto UI (Recommended)

Custom Analysis

Best Practices

Troubleshooting

No output files generated

"Counter not available" errors

High overhead affecting results

Additional Resources

Examples

Example 1: Profile LLM Inference (Compute-Bound)

Example 2: Profile Multi-GPU Training (Communication-Bound)

Example 3: Profile Image Model (Memory-Bound)

Example 4: Quick Test with Dummy Model

1. Compute-Bound Profiling (`rocprofv3_compute_bound.json`)

2. Memory-Bound Profiling (`rocprofv3_memory_bound.json`)

3. Multi-GPU Profiling (`rocprofv3_multi_gpu.json`)

4. Comprehensive Profiling (`rocprofv3_comprehensive.json`)

5. Lightweight Profiling (`rocprofv3_lightweight.json`)

6. rocm-trace-lite (`rocm_trace_lite.json`, `rocm_trace_lite_default.json`)

7. Multi-Node Distributed (`rocprofv3_multi_node.json`)

Always Include the `--` Separator