Skip to content

Table "assigned_resources" may be inconsistent, leading to phoenix ignoring some nodes #177

@bzizou

Description

@bzizou

Symptoms:
Phoenix is configured by default to not reboot suspected nodes that still have jobs running. This is configured by excluding nodes having a resource into the CURRENT state into the assigned_resources table. We noticed that our phoenix instance is always ignoring some nodes that don't have jobs running on it anymore.

The suspected bug:
A deep look inside our OAR database, revealed at least for one job, that we had such an error:
2020-05-24 00:02:14> EXIT_VALUE_OAREXEC:[bipbip 36324341] error of oarexec, exit value = 61; the job 36324341 is in Error and the node luke17 is Suspected; If this job is of type cosystem or deploy, check if the oar server is able to connect to the corresponding nodes, oar-node started
The luke17 node was never rebooted by phoenix after this date.
And we found that the corresponding resource was still in the CURRENT state into the assigned_resources table.

 moldable_job_id | resource_id | assigned_resource_index
-----------------+-------------+-------------------------
        36324736 |         391 | CURRENT

 moldable_id | moldable_job_id | moldable_walltime | moldable_index                                                                                                                                                
-------------+-----------------+-------------------+----------------                                                                                                                                               
    36324736 |        36324341 |              3600 | LOG              

Removing the inconsistent CURRENT entry solved the problem.

So, maybe the case "EXIT_VALUE_OAREXEC" when launching a job does not pass the CURRENT entry to LOG into assigned_resources ?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions