-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop deleting chained-to jobs which fail as orphaned jobs #4557
Conversation
This reverts commit dce4610.
@glennhickey Do you want to see if this stops you from reproducing #4504? Or if a job getting chained to, and failing, and then being deleted as an orphan, is consistent with your full logs? |
Thanks @adamnovak ! I've been having trouble reproducing this issue, but will give it another go. If I can't get the cluster to kill it again, I'll just do it myself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Looks like I actually added a bug here. In
|
@adamnovak I can confirm that this change does indeed fix my issue. |
…over to the file store
It looks like our job description deepcopy never worked quite right. To fix the fighting over who is allowed to write to what copy of the job description that made the test fail, I'm moving responsibility for snapshotting job descriptions for asynchronous commit into the file store that does the asynchronous committing and out of the worker. But that means now I'm doing chained deepcopies and that's what didn't work. But I think I have it all working now. |
This will fix #4504 by improving the semantics of
jobsToDelete
(which is nowmerged_jobs
, to make it clear that it has very different semantics thanfilesToDelete
, which journals file deletions).Previously, when a job was chained to, we would put it in
jobsToDelete
on the chaining-from job, and then delete it when that job finished successfully.But we also at some point started cutting the successor relationship to the chained-to job, which meant that if the chained-to job failed, when we went through the job tree during the restart, we would find the chained-to job's original job description under its original ID as an orphaned job and delete it.
This adds code to treat the original chained-to job as more properly merged into the chained-from job, and so still reachable from the root.
I'm also making the
filesToDelete
system more sure about what it is actually supposed to be used for.Changelog Entry
To be copied to the draft changelog by merger:
Reviewer Checklist
issues/XXXX-fix-the-thing
in the Toil repo, or from an external repo.camelCase
that want to be insnake_case
.docs/running/{cliOptions,cwl,wdl}.rst
Merger Checklist