Skip to content

MPI Deadlock when removing a scaffold #187

@drodarie

Description

@drodarie

How to reproduce

on branch docs/yaml-examples

Run the unittest.test_selectors.test_nm_selector_wrong_name in parallel (2 nodes) (remove the skip_parallel decorator)
The test is passed without issue but gets stuck when reaching the TearDownClass method which deletes the storage.
For some reasons, the main process pass all the MPI barriers without any trouble but the second process gets stuck at this barrier:

self.comm.barrier()

I don't know why this happens only in for this unittest.

Attempts to fix the issue

I thought this might be due to the morphologies being saved in the storage despite the NeuroMorphoSelector being replaced so I added the NeuroMorphoSelector.__unboot__ function:

    def __unboot__(self):
        if self.scaffold.is_main_process():
            for name in self.names:
                try:
                    self.scaffold.morphologies.remove(name)
                except MorphologyRepositoryError:
                    pass  # morphology was not saved in the scaffold.
        self.scaffold._comm.barrier()

For this function to work, I had to move the run_hook(node, "unboot") line in _unset_nodes at the top:

run_hook(node, "unboot")

def _unset_nodes(top_node):
    for node in walk_nodes(top_node):
        run_hook(node, "unboot")
        with contextlib.suppress(Exception):
            del node.scaffold
        node._config_parent = None
        node._config_key = None
        if hasattr(node, "_config_index"):
            node._config_index = None

This changes works well (removes the morphology as expected) but this did not prevent the unittest to get stuck but I find it cleaner to remove the morphologies obtained through the selector once we delete it...

Metadata

Metadata

Assignees

Labels

bsb-hdf5Auto-created by migration script

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions