Skip to content

Catch term signals in Savanna to cleanup #226

@kshitij-v-mehta

Description

@kshitij-v-mehta

If a batch job submitted to a scheduler on an HPC machine times out, the state of running experiments does not get updated from running to timeout. Thereafter, the status reporting engine reports experiments as still running, even though the job has exited.

We need to catch the terminate signal sent to the master Savanna process (signal.SIGUSR2 passed by the scheduler on Summit) to cleanup and leave the campaign in a consistent state. Looks like a process has a few seconds to clean up after receiving an initial signal, which should suffice to cleanup.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions