Skip to content

Conversation

@ashesh2512
Copy link

Distributed PyTorch launchers such as torchrun often lack the flexibility to map tasks to GPUs as per the NUMA domains of Frontier. Users have often reported subpar performance when using these launchers. Consequently, it is useful for the user to know if their distributed PyTorch program is making use of the node resources in the most optimal manner possible. numa_api leverages the numactl library to identify which cores a process is bound to, as well as associated GPU for optimal binding. This can be useful because users typically set the GPU IDs manually in frameworks like PyTorch and TensorFlow.

core affinity for PID 1132009: 41 42 43 44 45 46 47
Suggested GPU for PID 1132009: 7
In contrast, using ``srun --gpus-per-task=8 --gpu-bind=closest torchrun --nproc_per_node=8 --nnodes=1 --rdzv-id=$SLURM_JOBID --rdzv-backend=c10d --rdzv-endpoint=$MASTER_ADDR:3440`` results in the following output, i.e., here ``torchrun`` launches each task on the same core resulting in subpar performance. Note, the GPU suggestions printed by ``numa_api`` are based on the NUMA regions associated with the cores in :ref:`frontier-nodes`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is incredibly helpful in describing the dilemma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants