include documentation about numa_api #1028

ashesh2512 · 2025-12-10T04:25:05Z

Distributed PyTorch launchers such as torchrun often lack the flexibility to map tasks to GPUs as per the NUMA domains of Frontier. Users have often reported subpar performance when using these launchers. Consequently, it is useful for the user to know if their distributed PyTorch program is making use of the node resources in the most optimal manner possible. numa_api leverages the numactl library to identify which cores a process is bound to, as well as associated GPU for optimal binding. This can be useful because users typically set the GPU IDs manually in frameworks like PyTorch and TensorFlow.

t-ramz · 2025-12-12T18:22:31Z

software/analytics/pytorch_frontier.rst

+   core affinity for PID 1132009: 41 42 43 44 45 46 47
+   Suggested GPU for PID 1132009: 7
+
+In contrast, using ``srun --gpus-per-task=8 --gpu-bind=closest torchrun --nproc_per_node=8 --nnodes=1 --rdzv-id=$SLURM_JOBID --rdzv-backend=c10d --rdzv-endpoint=$MASTER_ADDR:3440`` results in the following output, i.e., here ``torchrun`` launches each task on the same core resulting in subpar performance. Note, the GPU suggestions printed by ``numa_api`` are based on the NUMA regions associated with the cores in :ref:`frontier-nodes`.


This example is incredibly helpful in describing the dilemma.

software/analytics/pytorch_frontier.rst

add initial documentation about numa api

13fb5ea

t-ramz reviewed Dec 12, 2025

View reviewed changes

software/analytics/pytorch_frontier.rst Show resolved Hide resolved

t-ramz reviewed Dec 12, 2025

View reviewed changes

software/analytics/pytorch_frontier.rst Outdated Show resolved Hide resolved

t-ramz reviewed Dec 12, 2025

View reviewed changes

software/analytics/pytorch_frontier.rst Outdated Show resolved Hide resolved

Anthony Ramirez and others added 2 commits December 12, 2025 13:43

Update for readability on middle paragraph

36adf80

modify statement at end of sentence

c69892b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

include documentation about numa_api #1028

include documentation about numa_api #1028

Uh oh!

ashesh2512 commented Dec 10, 2025

Uh oh!

t-ramz Dec 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

include documentation about numa_api #1028

Are you sure you want to change the base?

include documentation about numa_api #1028

Uh oh!

Conversation

ashesh2512 commented Dec 10, 2025

Uh oh!

t-ramz Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants