Hi, first of all thanks for quick response and I found that the examples and their .xml generated algos do have significant impact on the system performance. I moved one step further to test custom auto generated algos. Follwowing example script works well for the allgather :
from msccl.topologies import dgx_a100
from msccl.collectives import allgather, alltoall, reduce_scatter, allreduce
from msccl.collectives import reduce,scatter,gather,broadcast
from pprint import pprint
from msccl.strategies import solve_instance
from msccl.instance import Instance
from msccl.language import *
from msccl.topologies import *
from msccl.serialization import MSCCLEncoder
import os
topology = dgx_a100()
pprint(topology.links)
collective = allgather(topology.num_nodes())
algo = solve_instance(topology, collective, Instance(steps=4), logging=True)
jsonfile = MSCCLEncoder().encode(algo)
with open("data.json", "w") as text_file:
text_file.write(jsonfile)
os.system("msccl ncclize -f data.json -o test.xml")
Output
[[0, 12, 12, 12, 12, 12, 12, 12],
[12, 0, 12, 12, 12, 12, 12, 12],
[12, 12, 0, 12, 12, 12, 12, 12],
[12, 12, 12, 0, 12, 12, 12, 12],
[12, 12, 12, 12, 0, 12, 12, 12],
[12, 12, 12, 12, 12, 0, 12, 12],
[12, 12, 12, 12, 12, 12, 0, 12],
[12, 12, 12, 12, 12, 12, 12, 0]]
Solving instance steps=4... synthesized! (0.4s)
Wrote to test.xml
But the problem comes if I change allgather to allreduce as below:
from msccl.topologies import dgx_a100
from msccl.collectives import allgather, alltoall, reduce_scatter, allreduce
from msccl.collectives import reduce,scatter,gather,broadcast
from pprint import pprint
from msccl.strategies import solve_instance
from msccl.instance import Instance
from msccl.language import *
from msccl.topologies import *
from msccl.serialization import MSCCLEncoder
import os
topology = dgx_a100()
pprint(topology.links)
collective = allreduce(topology.num_nodes())
algo = solve_instance(topology, collective, Instance(steps=4), logging=True)
jsonfile = MSCCLEncoder().encode(algo)
with open("data.json", "w") as text_file:
text_file.write(jsonfile)
os.system("msccl ncclize -f data.json -o test.xml")
Output
[[0, 12, 12, 12, 12, 12, 12, 12],
[12, 0, 12, 12, 12, 12, 12, 12],
[12, 12, 0, 12, 12, 12, 12, 12],
[12, 12, 12, 0, 12, 12, 12, 12],
[12, 12, 12, 12, 0, 12, 12, 12],
[12, 12, 12, 12, 12, 0, 12, 12],
[12, 12, 12, 12, 12, 12, 0, 12],
[12, 12, 12, 12, 12, 12, 12, 0]]
Solving instance steps=4... synthesized! (1.4s)
Traceback (most recent call last):
File "/miniconda3/envs/py38/bin/msccl", line 33, in <module>
sys.exit(load_entry_point('msccl==2.3.0', 'console_scripts', 'msccl')())
File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/__main__.py", line 34, in main
File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/cli/ncclize.py", line 29, in handle
File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/ncclize.py", line 548, in ncclize
RuntimeError: Encountered receive and send on the same buffer index on step 1 (gpu=5, buf=i, off=0)
Can you please help check and resolve this issue so that I can use the generated .xml. Thanks in advance
Hi, first of all thanks for quick response and I found that the examples and their .xml generated algos do have significant impact on the system performance. I moved one step further to test custom auto generated algos. Follwowing example script works well for the
allgather:Output
But the problem comes if I change
allgathertoallreduceas below:Output
Can you please help check and resolve this issue so that I can use the generated .xml. Thanks in advance