Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop deleting chained-to jobs which fail as orphaned jobs #4557

Merged
merged 15 commits into from
Aug 10, 2023

Conversation

adamnovak
Copy link
Member

This will fix #4504 by improving the semantics of jobsToDelete (which is now merged_jobs, to make it clear that it has very different semantics than filesToDelete, which journals file deletions).

Previously, when a job was chained to, we would put it in jobsToDelete on the chaining-from job, and then delete it when that job finished successfully.

But we also at some point started cutting the successor relationship to the chained-to job, which meant that if the chained-to job failed, when we went through the job tree during the restart, we would find the chained-to job's original job description under its original ID as an orphaned job and delete it.

This adds code to treat the original chained-to job as more properly merged into the chained-from job, and so still reachable from the root.

I'm also making the filesToDelete system more sure about what it is actually supposed to be used for.

Changelog Entry

To be copied to the draft changelog by merger:

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@adamnovak
Copy link
Member Author

@glennhickey Do you want to see if this stops you from reproducing #4504? Or if a job getting chained to, and failing, and then being deleted as an orphan, is consistent with your full logs?

@glennhickey
Copy link
Contributor

Thanks @adamnovak ! I've been having trouble reproducing this issue, but will give it another go. If I can't get the cluster to kill it again, I'll just do it myself.

Copy link
Member

@DailyDreaming DailyDreaming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@adamnovak
Copy link
Member Author

Looks like I actually added a bug here. In CachingFileStoreTestWithAwsJobStore.testAsyncWriteWithCaching CI managed to trigger:

	  File "/builds/databiosphere/toil/src/toil/job.py", line 1048, in replace
	    raise RuntimeError("Trying to take on the ID of a job that is in the process of being committed!")

@glennhickey
Copy link
Contributor

@adamnovak I can confirm that this change does indeed fix my issue.

@adamnovak
Copy link
Member Author

It looks like our job description deepcopy never worked quite right.

To fix the fighting over who is allowed to write to what copy of the job description that made the test fail, I'm moving responsibility for snapshotting job descriptions for asynchronous commit into the file store that does the asynchronous committing and out of the worker. But that means now I'm doing chained deepcopies and that's what didn't work. But I think I have it all working now.

@adamnovak adamnovak merged commit 55c0410 into master Aug 10, 2023
2 checks passed
@adamnovak adamnovak mentioned this pull request Aug 15, 2023
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to restart failed job due to NoSuchFileException
3 participants