Table "assigned_resources" may be inconsistent, leading to phoenix ignoring some nodes #177

bzizou · 2020-07-03T08:19:25Z

Symptoms:
Phoenix is configured by default to not reboot suspected nodes that still have jobs running. This is configured by excluding nodes having a resource into the CURRENT state into the assigned_resources table. We noticed that our phoenix instance is always ignoring some nodes that don't have jobs running on it anymore.

The suspected bug:
A deep look inside our OAR database, revealed at least for one job, that we had such an error:
2020-05-24 00:02:14> EXIT_VALUE_OAREXEC:[bipbip 36324341] error of oarexec, exit value = 61; the job 36324341 is in Error and the node luke17 is Suspected; If this job is of type cosystem or deploy, check if the oar server is able to connect to the corresponding nodes, oar-node started
The luke17 node was never rebooted by phoenix after this date.
And we found that the corresponding resource was still in the CURRENT state into the assigned_resources table.

 moldable_job_id | resource_id | assigned_resource_index
-----------------+-------------+-------------------------
        36324736 |         391 | CURRENT

 moldable_id | moldable_job_id | moldable_walltime | moldable_index                                                                                                                                                
-------------+-----------------+-------------------+----------------                                                                                                                                               
    36324736 |        36324341 |              3600 | LOG

Removing the inconsistent CURRENT entry solved the problem.

So, maybe the case "EXIT_VALUE_OAREXEC" when launching a job does not pass the CURRENT entry to LOG into assigned_resources ?

The text was updated successfully, but these errors were encountered:

npf added 2.5 question labels Dec 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table "assigned_resources" may be inconsistent, leading to phoenix ignoring some nodes #177

Table "assigned_resources" may be inconsistent, leading to phoenix ignoring some nodes #177

bzizou commented Jul 3, 2020

Table "assigned_resources" may be inconsistent, leading to phoenix ignoring some nodes #177

Table "assigned_resources" may be inconsistent, leading to phoenix ignoring some nodes #177

Comments

bzizou commented Jul 3, 2020