torchrun support #396
Replies: 2 comments 4 replies
-
|
Thanks, we didn't really investigate multi-node running so far AFAIK, as we don't generally have such nodes available for testing. |
Beta Was this translation helpful? Give feedback.
-
|
@hahahannes I wanted to circle back on this. Why I'm asking is because I'm now interested to do multi-node training myself on AMD GPUs at the LUMI HPC, and Ray is not that well supported by AMD apparently. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to run MLPF with Torchrun for multi node multi GPU training and I needed to make some changes. I am following the guide at https://pytorch.org/tutorials/beginner/ddp_series_fault_tolerance.html.
Let me know if that might interesting for you. I am tracking my changes in my fork. There is one commit with the changes so far. I added an argument
use-torchrunbut I think torchrun could also replace themp.spawnpartBeta Was this translation helpful? Give feedback.
All reactions