This directory contains pre-configured profiling setups for madengine. Most files target ROCprofv3; rocm_trace_lite.json / rocm_trace_lite_default.json enable rocm-trace-lite (not rocprofv3—do not combine with rocprof / rocprofv3_* on the same run).
Use Case: Models bottlenecked by ALU operations (e.g., large transformers with dense matrix operations)
Collected Metrics:
- Wave execution and cycles
- VALU (Vector ALU) instructions
- SALU (Scalar ALU) instructions
- Wait states
- GPU power consumption
Usage:
madengine run --tags your_model \
--additional-context-file examples/profiling-configs/rocprofv3_compute_bound.jsonUse Case: Models bottlenecked by memory bandwidth (e.g., large batch sizes, high-resolution inputs)
Collected Metrics:
- L1/L2 cache hit rates
- Memory read/write requests
- Cache efficiency
- VRAM usage over time
Usage:
madengine run --tags your_model \
--additional-context-file examples/profiling-configs/rocprofv3_memory_bound.jsonUse Case: Multi-GPU training with data parallel or model parallel
Collected Metrics:
- RCCL communication traces
- Inter-GPU memory transfers
- Scratch memory allocation
- Per-GPU power and VRAM
Usage:
madengine run --tags your_model \
--additional-context-file examples/profiling-configs/rocprofv3_multi_gpu.jsonUse Case: Full analysis with all available metrics (high overhead!)
Collected Metrics:
- All kernel traces (HIP, HSA, kernel, memory)
- Hardware performance counters
- Library call traces (MIOpen, rocBLAS)
- Power and VRAM monitoring
- Statistical summaries
Usage:
madengine run --tags your_model \
--additional-context-file examples/profiling-configs/rocprofv3_comprehensive.jsonWarning: This profile has significant overhead. Use for detailed analysis only.
Use Case: Production-like workloads with minimal profiling overhead
Collected Metrics:
- Basic HIP and kernel traces
- JSON output format (compact)
Usage:
madengine run --tags your_model \
--additional-context-file examples/profiling-configs/rocprofv3_lightweight.jsonUse Case: GPU kernel dispatch tracing without rocprofiler-sdk (SQLite output compatible with RPD-style tools). See the rocm-trace-lite documentation and Quick Start.
rocm_trace_lite.json— toolrocm_trace_lite: RTLlitemode (typically lower overhead).rocm_trace_lite_default.json— toolrocm_trace_lite_default: RTLdefaultmode (broader coverage; compare overhead vslite).
Do not combine with rocprof / rocprofv3_* on the same run.
Requirements / notes:
- madengine wraps the workload with
rtl_trace_wrapper.shand writes underrocm_trace_lite_output/(see Profiling Guide). - On the first run, the trace pre-script installs
rocm-trace-litefrom a GitHub Release wheel (not PyPI). The container needs HTTPS access to GitHub, unless the wheel is already installed in the image. - Default install uses a pinned wheel URL in the trace pre-script. Set
ROCM_TRACE_LITE_FOLLOW_LATEST=1to pull the latest release via the GitHub API instead (needscurl). Override withROCM_TRACE_LITE_WHEEL_URL(direct.whlURL) for air-gapped or custom platforms. Automation targets linux x86_64 wheels.
Usage:
madengine run --tags your_model \
--additional-context-file examples/profiling-configs/rocm_trace_lite.json
madengine run --tags your_model \
--additional-context-file examples/profiling-configs/rocm_trace_lite_default.jsonUse Case: Large-scale distributed training on SLURM clusters
Collected Metrics:
- RCCL communication patterns
- Cross-node synchronization
- Per-node power monitoring
Usage:
# Build phase
madengine build --tags your_model --registry your-registry:5000
# Deploy to SLURM
madengine run --manifest-file build_manifest.json \
--additional-context-file examples/profiling-configs/rocprofv3_multi_node.jsonmadengine run --tags dummy_prof \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"tools": [{"name": "rocprofv3_compute"}]
}'madengine run --tags your_model \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"docker_gpus": "all",
"distributed": {
"launcher": "torchrun",
"nproc_per_node": 8
},
"tools": [{"name": "rocprofv3_communication"}]
}'madengine run --tags your_model \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"tools": [{
"name": "rocprof",
"cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --hip-trace --kernel-trace --memory-copy-trace --output-format pftrace -d ./my_traces --",
"env_vars": {
"RCCL_DEBUG": "TRACE",
"HSA_ENABLE_SDMA": "0"
}
}]
}'When using custom profiling commands with rocprof_wrapper.sh, always include the trailing --:
{
"name": "rocprof",
"cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --sys-trace --"
}Why? The -- separator is critical for rocprofv3 (ROCm >= 7.0):
- rocprofv3 requires:
rocprofv3 [options] -- <application> - rocprof (legacy) accepts:
rocprof [options] <application>
The wrapper script auto-detects which profiler is available and formats the command correctly. Without the --, rocprofv3 will fail to parse arguments when the application command is appended.
❌ Wrong:
{"cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --sys-trace"}✅ Correct:
{"cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --sys-trace --"}| Tool Name | Description | Key Options | Overhead |
|---|---|---|---|
rocprofv3_compute |
Compute-bound analysis | Counter collection, VALU/SALU metrics | Medium |
rocprofv3_memory |
Memory bandwidth analysis | Cache hits/misses, memory transfers | Medium |
rocprofv3_communication |
Multi-GPU communication | RCCL trace, scratch memory | Medium |
rocprofv3_full |
Comprehensive profiling | All traces + counters + stats | High |
rocprofv3_lightweight |
Minimal overhead | HIP + kernel trace only | Low |
rocprofv3_perfetto |
Perfetto visualization | Perfetto-compatible output | Medium |
rocprofv3_api_overhead |
API call analysis | HIP/HSA/marker traces with stats | Low |
rocprofv3_pc_sampling |
Kernel hotspot analysis | PC sampling at 1000 Hz | Medium |
Other: rocm_trace_lite (RTL lite mode) and rocm_trace_lite_default (RTL default mode) — kernel dispatch SQLite trace via rocm-trace-lite, installed from GitHub Release wheels by the trace pre-script (not PyPI; see Profiling Guide). Not a rocprofv3 preset; do not combine with rocprof / rocprofv3_* on the same run.
Counter files are located at src/madengine/scripts/common/tools/counters/:
compute_bound.txt: Wave execution, VALU/SALU instructions, wait statesmemory_bound.txt: Cache metrics, memory controller traffic, LDS usagecommunication_bound.txt: PCIe traffic, atomic operations, synchronizationfull_profile.txt: Comprehensive set of all important metrics
You can create custom counter files and reference them in your profiling commands.
After profiling, madengine writes outputs to the working directory:
rocprof_output/
├── <timestamp>/
│ ├── *_results.db # ROCprofv3 database (SQLite)
│ ├── kernel_trace.csv # Kernel execution traces
│ ├── hip_api_trace.csv # HIP API calls
│ └── memory_copy_trace.csv # Memory transfers
├── model_trace.pftrace # Perfetto format (if using rocprofv3_perfetto)
└── trace.json # JSON format (if using rocprofv3_lightweight)
gpu_info_power_profiler_output.csv # Power consumption over time
gpu_info_vram_profiler_output.csv # VRAM usage over time
library_trace.csv # Library API calls (if library tracing enabled)
rocm_trace_lite_output/trace.db # rocm-trace-lite (also trace.json.gz / trace_summary.txt as emitted by RTL)
# If using rocprofv3_perfetto or output-format pftrace
# Upload files to https://ui.perfetto.dev/import sqlite3
import pandas as pd
# Parse ROCprofv3 database
conn = sqlite3.connect('rocprof_output/<timestamp>/*_results.db')
kernels = pd.read_sql_query("SELECT * FROM kernels", conn)
print(kernels.head())- Start lightweight: Use
rocprofv3_lightweightfor initial profiling - Target your bottleneck: Use specific profiles (compute/memory/communication) based on initial findings
- Avoid full profiling in production:
rocprofv3_fulladds 20-50% overhead - Multi-GPU: Always enable RCCL tracing for distributed workloads
- Sampling rates: Reduce sampling rates for long-running jobs (e.g., 1.0 instead of 0.1)
- Counter multiplexing: ROCprofv3 may need multiple runs if too many counters are requested
# Check if rocprofv3 is available
which rocprofv3
rocprofv3 --version
# Verify ROCm version (>= 7.0 recommended for rocprofv3)
rocm-smi --versionSome counters may not be available on all GPU architectures. Check available counters:
rocprofv3-availUse rocprofv3_lightweight or reduce counter collection:
# Remove counter collection for minimal overhead
madengine run --tags your_model \
--additional-context '{
"tools": [{
"name": "rocprof",
"cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --hip-trace --kernel-trace --output-format json -d ./traces --"
}]
}'madengine run --tags pyt_vllm_llama2_7b \
--additional-context-file examples/profiling-configs/rocprofv3_compute_bound.jsonmadengine run --tags pyt_torchtitan_llama3_8b \
--additional-context-file examples/profiling-configs/rocprofv3_multi_gpu.jsonmadengine run --tags pyt_torchvision_resnet50 \
--additional-context-file examples/profiling-configs/rocprofv3_memory_bound.jsonmadengine run --tags dummy_prof \
--additional-context-file examples/profiling-configs/rocprofv3_lightweight.json