-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Symptoms:
Phoenix is configured by default to not reboot suspected nodes that still have jobs running. This is configured by excluding nodes having a resource into the CURRENT state into the assigned_resources table. We noticed that our phoenix instance is always ignoring some nodes that don't have jobs running on it anymore.
The suspected bug:
A deep look inside our OAR database, revealed at least for one job, that we had such an error:
2020-05-24 00:02:14> EXIT_VALUE_OAREXEC:[bipbip 36324341] error of oarexec, exit value = 61; the job 36324341 is in Error and the node luke17 is Suspected; If this job is of type cosystem or deploy, check if the oar server is able to connect to the corresponding nodes, oar-node started
The luke17 node was never rebooted by phoenix after this date.
And we found that the corresponding resource was still in the CURRENT state into the assigned_resources table.
moldable_job_id | resource_id | assigned_resource_index
-----------------+-------------+-------------------------
36324736 | 391 | CURRENT
moldable_id | moldable_job_id | moldable_walltime | moldable_index
-------------+-----------------+-------------------+----------------
36324736 | 36324341 | 3600 | LOG
Removing the inconsistent CURRENT entry solved the problem.
So, maybe the case "EXIT_VALUE_OAREXEC" when launching a job does not pass the CURRENT entry to LOG into assigned_resources ?