-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
If a batch job submitted to a scheduler on an HPC machine times out, the state of running experiments does not get updated from running to timeout. Thereafter, the status reporting engine reports experiments as still running, even though the job has exited.
We need to catch the terminate signal sent to the master Savanna process (signal.SIGUSR2 passed by the scheduler on Summit) to cleanup and leave the campaign in a consistent state. Looks like a process has a few seconds to clean up after receiving an initial signal, which should suffice to cleanup.