This repository contains the AINIC network plugin which extends AMD's RCCL library for networking capabilities.
- Overview
- System Requirements
- Cloning and Building RCCL
- Cloning This Plugin Project
- Build Instructions
- Install Instructions
- Cleanup Instructions
- Enabling Telemetry
- Device Status JSON
ANP is a plugin library designed to enhance the RCCL collective communication library with extended network transport support. The Makefile here compiles this plugin against a locally built RCCL and a ROCm environment.
- Operating System: Linux-based distributions (e.g., Ubuntu, RHEL, CentOS).
- ROCm:
- Installed ROCm (tested with ROCm 6.x or later).
- Default ROCm path is
/opt/rocm; useROCM_PATHto override.
- Dependencies:
- A working build of RCCL.
hipccfrom the ROCm toolchain.- System libraries for network communication (e.g.,
libibverbs).
If you do not already have a built version of RCCL, follow the steps here: https://github.com/ROCm/rccl
In a separate directory of your choice:
git clone https://github.com/ROCm/amd-anp.git
cd amd-anp-
Ensure RCCL is built: The
rccl/builddirectory must containlibrccl.so,hipify/,include/, etc. -
Set
RCCL_HOME:- Provide the path to the RCCL source tree
export RCCL_HOME=/home/user/rccl-src/
- Provide the path to the RCCL source tree
-
Optional: Set
RCCL_BUILD:- If the RCCL artifacts are not in $RCCL_HOME/build/release, then point to the build artifacts using RCCL_BUILD env var For example:
export RCCL_BUILD=/home/user/rccl/build
- If the RCCL artifacts are not in $RCCL_HOME/build/release, then point to the build artifacts using RCCL_BUILD env var For example:
-
Optional:
ROCM_PATH:- If ROCm is in a custom directory (not
/opt/rocm), specify:make RCCL_HOME=$RCCL_HOME ROCM_PATH=/path/to/rocm
- If ROCm is in a custom directory (not
-
Build Without Telemetry (Default):
To build the plugin without the telemetry features enabled (the default behavior), simply run the
makecommand:make RCCL_HOME=$RCCL_HOMEIf successful, you will see
librccl-anp.soin thebuild/folder of this plugin project.Example:
make RCCL_HOME=/home/user/rccl-src/
-
Build With Telemetry Enabled:
To build the plugin with telemetry features enabled use the build command with flag ANP_TELEMETRY_ENABLED=1.
make ANP_TELEMETRY_ENABLED=1 RCCL_HOME=$RCCL_HOMEIf successful, you will see
librccl-anp.soin thebuild/folder of this plugin project.Example:
make ANP_TELEMETRY_ENABLED=1 RCCL_HOME=/home/user/rccl-src/
To install the plugin into your ROCm library path, run:
sudo make RCCL_HOME=$RCCL_HOME ROCM_PATH=/path/to/rocm installThis copies librccl-anp.so to <ROCM_PATH>/lib.
<ROCM_PATH> defaults to /opt/rocm unless overridden by ROCM_PATH.
To load this specific AINIC plugin library when running RCCL, set the env -x NCCL_NET_PLUGIN=librccl-anp.so.
Setting this env, RCCL will load this specific AINIC plugin library instead of the default plugin library librccl-net.so.
The clean target is used to remove the build directory and its contents. This is useful for starting a fresh build or for cleaning up intermediate build artifacts.
Usage:
make cleanThe uninstall target is used to remove the compiled plugin library librccl-anp.so from the installation path <ROCM_PATH>/lib
Usage:
make uninstallAMD ANP plugin provides telemetry capabilities for monitoring device status and performance. The telemetry data is captured and stored in JSON format, giving insights into communication efficiency and queue pair operations. This feature is part of the supported telemetry suite and helps in performance analysis and debugging.
To enable telemetry, the plugin must be compiled with ANP_TELEMETRY_ENABLED=1. It then reads its configuration from a JSON file, whose location is specified by an environment variable RCCL_ANP_CONFIG_FILE. In the absence of the environment variable RCCL_ANP_CONFIG_FILE or the JSON file being unreadable, plugin uses defaults for the configuration.
export RCCL_ANP_CONFIG_FILE=/path/to/config.json
{
"log_level": "ERROR",
"output_dir": "/tmp",
"max_buckets": 5,
"bucket_interval_ns": 30000
}log_level: Specifies the log level for telemetry logs.output_dir: Specifies the output directory for files generated by plugin.max_buckets: Maximum number of histogram buckets for latency tracking.bucket_interval_ns: Time interval (in nanoseconds) for latency bucket division.
AMD ANP plugin generates a JSON file containing detailed device status and performance metrics. The JSON structure provides a hierarchical view of device status, including:
- Device metadata (host, process details, RoCE device, etc.).
- Channel-level stats, including queue pairs.
- Queue pair stats, covering WQE send/receive/completion data.
- Latency histogram for WQE completions.
- Aggregated statistics, including WQE size distribution and overall counts.
This generated data serves as part of the supported telemetry features. It can be used to monitor network performance, analyze latencies, and optimize communication between devices.
The JSON contains a list of devices, where each device has:
- Status Information: Metadata about the device and the running process.
- Channels: Each device contains multiple channels.
- Queue Pairs (QP): Each channel has multiple queue pairs that handle communication.
- Statistics: Various performance metrics are recorded at different levels.
Each device contains the following information under the status key:
| Key | Description |
|---|---|
host_name |
Host machine name running the process |
process_name |
Name of the running process |
process_id |
Process ID |
start_time |
Start time of the process |
end_time |
End time of the process |
device_id |
Unique identifier of the device |
eth_device |
Ethernet device name (if applicable) |
roce_device |
RoCE (RDMA over Converged Ethernet) device name |
num_channels |
Number of channels in the device |
Example:
"status": {
"host_name": "test1",
"process_name": "all_reduce_perf",
"process_id": "155048",
"start_time": "2025-03-12 05:25:58",
"end_time": "2025-03-12 05:26:15",
"device_id": "0",
"eth_device": "",
"roce_device": "roce_ai3",
"num_channels": "16"
}Each device has multiple channels, where each channel contains:
- Queue Pairs (
queue_pairs): Communication endpoints for message exchanges. - Statistics (
stats): Aggregated stats across queue pairs in the channel.
| Key | Description |
|---|---|
id |
Unique channel identifier |
num_queue_pairs |
Number of queue pairs in the channel |
queue_pairs |
List of queue pairs in this channel |
stats |
Aggregated statistics for the channel |
| Key | Description |
|---|---|
num_wqe_sent |
Number of WQEs (Work Queue Entries) sent |
num_wqe_rcvd |
Number of WQEs received |
num_cts_sent |
Number of CTS messages sent |
num_data_qp |
Number of data queue pairs |
num_cts_qp |
Number of CTS queue pairs |
Example:
"stats": {
"num_wqe_sent": "4752",
"num_wqe_rcvd": "4752",
"num_cts_sent": "59232",
"num_data_qp": "1",
"num_cts_qp": "1"
}Each queue_pairs entry contains:
- Queue Pair ID
- Status
- Statistics
| Key | Description |
|---|---|
id |
Unique queue pair identifier |
status |
Contains additional queue pair-specific info |
stats |
Performance metrics for this queue pair |
| Key | Description |
|---|---|
num_wqe_sent |
Number of WQEs sent |
num_wqe_rcvd |
Number of WQEs received |
num_wqe_completed |
Number of WQEs completed |
num_slot_miss |
Number of slot misses |
wqe_completion_ns_min |
Minimum WQE completion latency (ns) |
wqe_completion_ns_max |
Maximum WQE completion latency (ns) |
wqe_completion_metrics |
Histogram of WQE completion latencies |
Example:
"stats": {
"num_wqe_sent": "4752",
"num_wqe_rcvd": "4752",
"num_wqe_completed": "4752",
"num_slot_miss": "0",
"wqe_completion_ns_min": "8362",
"wqe_completion_ns_max": "101821",
"wqe_completion_metrics": [
{
"latency_in_ns": "32767",
"num_wqe": "4173"
},
{
"latency_in_ns": "65535",
"num_wqe": "527"
}
]
}