Conversation
|
Hi @t-ramz , Regarding MPI_ABORT. If you have specific ideas let me know. We currently have anyway trouble often. For example if one of the ranks sends a signal 9 / 15 or something often other ranks don't die etc. Best, |
|
Hi Alex, I believe I ran into the One way around this, which I discussed with a colleague, may be to create a secondary MPI communicator and use that communicator for the Thanks, |
|
Yes the forking and spawning a subprocess is not super elegant and a bit problematic for some HPC configs. Maybe it is worth exploring spawning another communicator. Maybe not in the scope of this PR? If you can comment on the above mentioned questions we can merge this soon. Best, |
|
|
||
| hostnames = mpi.world.gather(socket.gethostname(), root=0) | ||
| if cluster_name == 'slurm': | ||
| slurm_hostnames = [hostname.split('.')[0] for hostname in hostnames] # TODO: please find a better solution |
There was a problem hiding this comment.
can you enlighten me why this extra part is necessary in this case?
|
|
||
| if mpi_profile == 'slurm': | ||
| return [ | ||
| mpi_exe, '-n', str(number_cores), '--export=PATH', |
There was a problem hiding this comment.
should we enforce here that mpi_exe is 'srun' ?
|
Sry I just remembered that this PR was still open. Do you see my last two comments? can you enlighten me why this extra part is necessary in this case? should we enforce here that mpi_exe is 'srun' ? If we discuss these two than we can merge this. |
I have a user at ORNL's OLCF that has requested to use sold_dmft on one of our resources and I have had trouble making it run with the slurm workload manager. I made a few changes here that can help alleviate the problem, and make it "work," but there remains a problem I'm having where submitted sub-jobs are causing an MPI error where it causes an
MPI_ABORTthat may be beyond the scope of this software.Overall, I think if you want to include slurm-only support out of the box there may need to be a switch or feature in place to divvy up tasks from the host MPI process and allocate to VASP/
dft_exeaccordingly. E.g. get all ranks up front runningsolid_dmftand when runningdft_exeprovide explicit rank mappings in some way.Take this with a grain of salt, I don't know your codebase.
Let me know if you have thoughts or questions and I'll do my best to get back to you!