You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Symptoms:
Phoenix is configured by default to not reboot suspected nodes that still have jobs running. This is configured by excluding nodes having a resource into the CURRENT state into the assigned_resources table. We noticed that our phoenix instance is always ignoring some nodes that don't have jobs running on it anymore.
The suspected bug:
A deep look inside our OAR database, revealed at least for one job, that we had such an error: 2020-05-24 00:02:14> EXIT_VALUE_OAREXEC:[bipbip 36324341] error of oarexec, exit value = 61; the job 36324341 is in Error and the node luke17 is Suspected; If this job is of type cosystem or deploy, check if the oar server is able to connect to the corresponding nodes, oar-node started
The luke17 node was never rebooted by phoenix after this date.
And we found that the corresponding resource was still in the CURRENT state into the assigned_resources table.
Symptoms:
Phoenix is configured by default to not reboot suspected nodes that still have jobs running. This is configured by excluding nodes having a resource into the
CURRENT
state into theassigned_resources
table. We noticed that our phoenix instance is always ignoring some nodes that don't have jobs running on it anymore.The suspected bug:
A deep look inside our OAR database, revealed at least for one job, that we had such an error:
2020-05-24 00:02:14> EXIT_VALUE_OAREXEC:[bipbip 36324341] error of oarexec, exit value = 61; the job 36324341 is in Error and the node luke17 is Suspected; If this job is of type cosystem or deploy, check if the oar server is able to connect to the corresponding nodes, oar-node started
The
luke17
node was never rebooted by phoenix after this date.And we found that the corresponding resource was still in the
CURRENT
state into theassigned_resources
table.Removing the inconsistent
CURRENT
entry solved the problem.So, maybe the case "EXIT_VALUE_OAREXEC" when launching a job does not pass the
CURRENT
entry toLOG
intoassigned_resources
?The text was updated successfully, but these errors were encountered: