Skip to content

Problem in generating xml for the allreduce  #56

@azharlightelligence

Description

@azharlightelligence

Hi, first of all thanks for quick response and I found that the examples and their .xml generated algos do have significant impact on the system performance. I moved one step further to test custom auto generated algos. Follwowing example script works well for the allgather :

from msccl.topologies import dgx_a100
from msccl.collectives import allgather, alltoall, reduce_scatter, allreduce
from msccl.collectives import reduce,scatter,gather,broadcast
from pprint import pprint
from msccl.strategies import solve_instance
from msccl.instance import Instance
from msccl.language import *
from msccl.topologies import *
from msccl.serialization import MSCCLEncoder
import os

topology = dgx_a100()
pprint(topology.links)
collective = allgather(topology.num_nodes())
algo = solve_instance(topology, collective, Instance(steps=4), logging=True)
jsonfile = MSCCLEncoder().encode(algo)
with open("data.json", "w") as text_file:
    text_file.write(jsonfile)
os.system("msccl ncclize -f data.json -o test.xml")

Output

[[0, 12, 12, 12, 12, 12, 12, 12],
 [12, 0, 12, 12, 12, 12, 12, 12],
 [12, 12, 0, 12, 12, 12, 12, 12],
 [12, 12, 12, 0, 12, 12, 12, 12],
 [12, 12, 12, 12, 0, 12, 12, 12],
 [12, 12, 12, 12, 12, 0, 12, 12],
 [12, 12, 12, 12, 12, 12, 0, 12],
 [12, 12, 12, 12, 12, 12, 12, 0]]
Solving instance steps=4... synthesized! (0.4s)
Wrote to test.xml

But the problem comes if I change allgather to allreduce as below:

from msccl.topologies import dgx_a100
from msccl.collectives import allgather, alltoall, reduce_scatter, allreduce
from msccl.collectives import reduce,scatter,gather,broadcast
from pprint import pprint
from msccl.strategies import solve_instance
from msccl.instance import Instance
from msccl.language import *
from msccl.topologies import *
from msccl.serialization import MSCCLEncoder
import os

topology = dgx_a100()
pprint(topology.links)
collective = allreduce(topology.num_nodes())
algo = solve_instance(topology, collective, Instance(steps=4), logging=True)
jsonfile = MSCCLEncoder().encode(algo)
with open("data.json", "w") as text_file:
    text_file.write(jsonfile)
os.system("msccl ncclize -f data.json -o test.xml")

Output

[[0, 12, 12, 12, 12, 12, 12, 12],
 [12, 0, 12, 12, 12, 12, 12, 12],
 [12, 12, 0, 12, 12, 12, 12, 12],
 [12, 12, 12, 0, 12, 12, 12, 12],
 [12, 12, 12, 12, 0, 12, 12, 12],
 [12, 12, 12, 12, 12, 0, 12, 12],
 [12, 12, 12, 12, 12, 12, 0, 12],
 [12, 12, 12, 12, 12, 12, 12, 0]]
Solving instance steps=4... synthesized! (1.4s)
Traceback (most recent call last):
  File "/miniconda3/envs/py38/bin/msccl", line 33, in <module>
    sys.exit(load_entry_point('msccl==2.3.0', 'console_scripts', 'msccl')())
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/__main__.py", line 34, in main
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/cli/ncclize.py", line 29, in handle
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/ncclize.py", line 548, in ncclize
RuntimeError: Encountered receive and send on the same buffer index on step 1 (gpu=5, buf=i, off=0)

Can you please help check and resolve this issue so that I can use the generated .xml. Thanks in advance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions