Stop deleting chained-to jobs which fail as orphaned jobs #4557

adamnovak · 2023-08-03T22:56:01Z

This will fix #4504 by improving the semantics of jobsToDelete (which is now merged_jobs, to make it clear that it has very different semantics than filesToDelete, which journals file deletions).

Previously, when a job was chained to, we would put it in jobsToDelete on the chaining-from job, and then delete it when that job finished successfully.

But we also at some point started cutting the successor relationship to the chained-to job, which meant that if the chained-to job failed, when we went through the job tree during the restart, we would find the chained-to job's original job description under its original ID as an orphaned job and delete it.

This adds code to treat the original chained-to job as more properly merged into the chained-from job, and so still reachable from the root.

I'm also making the filesToDelete system more sure about what it is actually supposed to be used for.

Changelog Entry

To be copied to the draft changelog by merger:

Chained-to jobs which fail should no longer be deleted as orphans upon workflow restart (Unable to restart failed job due to NoSuchFileException #4504)

Reviewer Checklist

Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
- If it is coming from an external repo, make sure to pull it in for CI with:
```
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
```
- If there is no associated issue, create one.
Read through the code changes. Make sure that it doesn't have:
- Addition of trailing whitespace.
- New variable or member names in camelCase that want to be in snake_case.
- New functions without type hints.
- New functions or classes without informative docstrings.
- Changes to semantics not reflected in the relevant docstrings.
- New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
- New features without tests.
Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
Finish the review with an overall description of your opinion.

Merger Checklist

Make sure the PR passes tests.
Make sure the PR has been reviewed since its last modification. If not, review it.
Merge with the Github "Squash and merge" feature.
- If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
Copy its recommended changelog entry to the Draft Changelog.
Append the issue number in parentheses to the changelog entry.

This reverts commit dce4610.

adamnovak · 2023-08-03T22:58:22Z

@glennhickey Do you want to see if this stops you from reproducing #4504? Or if a job getting chained to, and failing, and then being deleted as an orphan, is consistent with your full logs?

glennhickey · 2023-08-04T15:01:25Z

Thanks @adamnovak ! I've been having trouble reproducing this issue, but will give it another go. If I can't get the cluster to kill it again, I'll just do it myself.

DailyDreaming

LGTM.

adamnovak · 2023-08-08T16:04:06Z

Looks like I actually added a bug here. In CachingFileStoreTestWithAwsJobStore.testAsyncWriteWithCaching CI managed to trigger:

	  File "/builds/databiosphere/toil/src/toil/job.py", line 1048, in replace
	    raise RuntimeError("Trying to take on the ID of a job that is in the process of being committed!")

…-transactionally

glennhickey · 2023-08-08T16:36:53Z

@adamnovak I can confirm that this change does indeed fix my issue.

…over to the file store

adamnovak · 2023-08-08T19:41:08Z

It looks like our job description deepcopy never worked quite right.

To fix the fighting over who is allowed to write to what copy of the job description that made the test fail, I'm moving responsibility for snapshotting job descriptions for asynchronous commit into the file store that does the asynchronous committing and out of the worker. But that means now I'm doing chained deepcopies and that's what didn't work. But I think I have it all working now.

adamnovak added 8 commits August 3, 2023 17:35

Use a different name for unreachable jobs

f064fb0

Get rid of unused jobsToDelete in the file store

263c456

Calrify semantics of filesToDelete as a write-ahead deletion journal

6694acf

Treat chaining as merging jobs

d9ffa09

Add a test for restarting a chained-to job

8377c40

Turn off the fix to prove that we can reproduce the problem

dce4610

Revert "Turn off the fix to prove that we can reproduce the problem"

7a1dd51

This reverts commit dce4610.

Resolve TODO

7631fcd

DailyDreaming approved these changes Aug 7, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/master' into issues/4504-chain…

336bf6c

…-transactionally

adamnovak added 6 commits August 8, 2023 12:48

Move responsibility for snapshotting job descriptions during commits …

e8e89b4

…over to the file store

Fix whitespace and change non caching layout back

5add163

Add a bunch of async commit debugging

9184782

Catch the deepcopy misbehavior

b446f47

Allow deepcopy on job descriptions to chain correctly

dfcdbb8

Remove extra debugging

3d4a6f5

adamnovak merged commit 55c0410 into master Aug 10, 2023
2 checks passed

adamnovak mentioned this pull request Aug 15, 2023

More WDL and Slurm documentation #4558

Merged

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop deleting chained-to jobs which fail as orphaned jobs #4557

Stop deleting chained-to jobs which fail as orphaned jobs #4557

adamnovak commented Aug 3, 2023

adamnovak commented Aug 3, 2023

glennhickey commented Aug 4, 2023

DailyDreaming left a comment

adamnovak commented Aug 8, 2023

glennhickey commented Aug 8, 2023

adamnovak commented Aug 8, 2023

Stop deleting chained-to jobs which fail as orphaned jobs #4557

Stop deleting chained-to jobs which fail as orphaned jobs #4557

Conversation

adamnovak commented Aug 3, 2023

Changelog Entry

Reviewer Checklist

Merger Checklist

adamnovak commented Aug 3, 2023

glennhickey commented Aug 4, 2023

DailyDreaming left a comment

Choose a reason for hiding this comment

adamnovak commented Aug 8, 2023

glennhickey commented Aug 8, 2023

adamnovak commented Aug 8, 2023