Add filesize sniffing and parallelize importing on jobs #5133

stxue1 · 2024-10-24T18:13:51Z

Closes #5114

Changelog Entry

To be copied to the draft changelog by merger:

Add support for parallel file imports
- Usurps Support importing on workers in WDL #5103 and Allow importing on workers #5098
- --importWorkersDisk replaced with --importWorkersThreshold. This specifies the threshold where files will begin to be imported on individual jobs. Small files will be batched into the same import job up to this threshold.

Reviewer Checklist

Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
- If it is coming from an external repo, make sure to pull it in for CI with:
```
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
```
- If there is no associated issue, create one.
Read through the code changes. Make sure that it doesn't have:
- Addition of trailing whitespace.
- New variable or member names in camelCase that want to be in snake_case.
- New functions without type hints.
- New functions or classes without informative docstrings.
- Changes to semantics not reflected in the relevant docstrings.
- New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
- New features without tests.
Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
Finish the review with an overall description of your opinion.

Merger Checklist

Make sure the PR passes tests.
Make sure the PR has been reviewed since its last modification. If not, review it.
Merge with the Github "Squash and merge" feature.
- If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
Copy its recommended changelog entry to the Draft Changelog.
Append the issue number in parentheses to the changelog entry.

…14-filesize-sniffing

adamnovak

I think the shape of the import logic is right, but I'm concerned about some variable names and some duplicated code, and about whether the CWL-side import logic actually has all its interdependencies documented sufficiently.

adamnovak · 2024-11-01T15:12:02Z

src/toil/cwl/cwltoil.py

 from toil.jobStores.abstractJobStore import (
    AbstractJobStore,
-    InvalidImportExportUrlException,
-    LocatorException,
    NoSuchFileException,
+    LocatorException,
+    InvalidImportExportUrlException,
    UnimplementedURLException,
 )


It looks like this might be undoing some of the formatting/import sorting improvements we recently merged; maybe the PR should be run through the code formatter Makefile target?

It seems like the Makefile code formatter formats quite a few files, except for cwltoil.py. I'll manually undo this change, maybe this was done on accident while I was messing with imports

adamnovak · 2024-11-01T15:17:12Z

src/toil/cwl/cwltoil.py

+def extract_files(
+    fileindex: Dict[str, str],
+    existing: Dict[str, str],
+    file_metadata: CWLObjectType,
+    mark_broken: bool = False,
+    skip_remote: bool = False,
+) -> Optional[str]:
+    """
+    Extract the filename from a CWL file record


If this operates on just one filename, it should have a singular name.

This function also does more than just extract a filename from a record. It consults and sometimes updates fileindex and it will also update file_metadata to fill in its location from its path, and it will find the realpath (i.e. resolve symlinks) for bare file paths but for some reason not for file:// URIs. The docstring needs to explain why it does these things. Otherwise the caller will be surprised when it happens.

This function has a bunch of borrowed code from upload_file and write_file. I think if the file:// scheme exists, some resolving was already done.

adamnovak · 2024-11-01T15:18:15Z

src/toil/cwl/cwltoil.py

+    """
+    Extract the filename from a CWL file record
+    :param fileindex: Forward mapping of filename
+    :param existing: Reverse mapping of filename. This function does not use this


We need this argument to match a pre-defined function signature, right? Maybe we should mention where the kind of function that this function needs to be is documented. And if there isn't any documentation on this kind of function, we should maybe come up with a name for it and document it.

adamnovak · 2024-11-01T15:24:26Z

src/toil/cwl/cwltoil.py

+            if not urlparse(location).scheme:
+                rp = os.path.realpath(location)
+            else:
+                rp = location


Why aren't symlinks in the file:// URI's path resolved here?

I borrowed this code from write_file. I believe all files should be resolved into file URIs by this point, so this is likely some edge case. From my limited testing, I can't trigger the realpath branch.

adamnovak · 2024-11-01T15:25:07Z

src/toil/cwl/cwltoil.py

+        # This is a local file, or we also need to download and re-upload remote files
+        if location not in fileindex:
+            # don't download twice


This function doesn't have any responsibility for deciding whether to download things; why are these comments talking about whether to download things?

adamnovak · 2024-11-01T18:40:57Z

src/toil/wdl/wdltoil.py

    """
-    return is_url_with_scheme(filename, REMOTE_SCHEMES)
+    Resolve relative-URI files in the given environment and them then into absolute normalized URIs. Returns a dictionary of WDL file values to a tuple of the normalized URI,


I don't think and them then makes sense.

We might also want to say something different than "WDL file values"; we mean the string values that would appear in the value field of a WDL.Value.File object. But you could also think that a WDL.Value.File object is itself a "WDL file value".

adamnovak · 2024-11-01T18:45:36Z

src/toil/wdl/wdltoil.py

+            except UnimplementedURLException as e:
+                # We can't find anything that can even support this URL scheme.
+                # Report to the user, they are probably missing an extra.
+                logger.critical("Error: " + str(e))
+                raise
+            except HTTPError as e:
+                # Something went wrong looking for it there.
+                logger.warning(
+                    "Checked URL %s but got HTTP status %s", candidate_uri, e.code
+                )
+                # Try the next location.
+                continue
+            except FileNotFoundError:
+                # Wasn't found there
+                continue
+            except Exception:
+                # Something went wrong besides the file not being found. Maybe
+                # we have no auth.
+                logger.error(
+                    "Something went wrong when testing for existence of %s",
+                    candidate_uri,
+                )
+                raise


This duplicates a lot of code with the CWL-side get_file_sizes(). Is there a common function that could be extracted here that polls a URL and returns whether it existed and, if so, the size if available, and raises on other errors?

Separated it out into job.py

adamnovak · 2024-11-01T18:48:06Z

src/toil/wdl/wdltoil.py

+def convert_files(
+    environment: WDLBindings,
+    file_to_id: Dict[str, FileID],
+    file_to_data: Dict[str, FileMetadata],
+    task_path: str,
+) -> None:


Instead of altering the File objects inside environment in place, this function should return a modified copy of environment. The WDLBindings objects in MiniWDL I think are meant to be immutable.

adamnovak · 2024-11-01T18:51:13Z

src/toil/wdl/wdltoil.py

@@ -5180,8 +5401,8 @@ class WDLStartJob(WDLSectionJob):

    def __init__(
        self,
-        target: WDL.Tree.Workflow | WDL.Tree.Task,
-        inputs: WDLBindings,
+        target: Union[WDL.Tree.Workflow, WDL.Tree.Task],


I think using | everywhere instead of Union is another pyupgrade change that we want to keep.

adamnovak · 2024-11-01T18:52:30Z

src/toil/wdl/wdltoil.py



 def make_root_job(
    target: WDL.Tree.Workflow | WDL.Tree.Task,
    inputs: WDLBindings,
-    inputs_search_path: list[str],
+    inputs_search_path: List[str],


@mr-c also changed to using the new generic support in the base list, dict, etc. instead of needing to import the versions from typing, so we shouldn't undo that.

…ere/toil into issues/5114-filesize-sniffing

…14-filesize-sniffing

src/toil/cwl/cwltoil.py

…ued imports in job + get rid of wdl dependency in job.py

Co-authored-by: Michael R. Crusoe <1330696+mr-c@users.noreply.github.com>

adamnovak

I still don't think the idea of hoping to notice via exception whether we have exhausted our disk space is going to work. Some backends just kill the job instead of giving you an exception, whereas others let you plow right through your disk limit and interfere with other jobs. (Usually other jobs in the workflow, of which there shouldn't really be any of note during file import, but it's still not a thing jobs are meant to knowingly do.)

I am also still dubious of the extract_file_uri_once design being the best approach there (who would think to themselves "I want to get the file URI but only if it hasn't already been put in the cache"?). But it kind of has to be the shape it is to be mapped over the files, so maybe it really is the best we can do?

adamnovak · 2024-11-15T23:19:14Z

src/toil/cwl/cwltoil.py

+    Dict,
+    Iterator,
+    List,


I think we don't need Dict and List anymore because we can use dict and list now.

adamnovak · 2024-11-15T23:20:13Z

src/toil/job.py

+    importer: str | None = None,
+    execution_dir: str | None = None,


We might actually still need Optional on 3.9.

adamnovak · 2024-11-15T23:22:43Z

src/toil/job.py

+            if file_basename == "":
+                # We can't have files with no basename because we need to
+                # download them at that basename later.
+                raise RuntimeError(
+                    f"File {candidate_uri} has no basename and so cannot be a WDL File"
+                )


This error message is WDL-specific. I'm not sure if it's OK to impose the constraint on CWL as well, but if we do we should just complain generically that the file has no basename.

Suggested change

if file_basename == "":

# We can't have files with no basename because we need to

# download them at that basename later.

raise RuntimeError(

f"File {candidate_uri} has no basename and so cannot be a WDL File"

)

if file_basename == "":

# We can't have files with no basename because we need to

# download them at that basename later in WDL.

raise RuntimeError(

f"File {candidate_uri} has no basename"

)

We have another empty basename check in the WDL-specific code; is this one maybe redundant?

adamnovak · 2024-11-15T23:24:45Z

src/toil/job.py

+        streaming, so if true, assume streaming works and don't give the worker a lot of disk space to work with.
+        If streaming fails, the worker will run out of resources and allocate a child job to handle the import with enough disk space.


We might need to indent this for the docs to parse it right.

Suggested change

streaming, so if true, assume streaming works and don't give the worker a lot of disk space to work with.

If streaming fails, the worker will run out of resources and allocate a child job to handle the import with enough disk space.

streaming, so if true, assume streaming works and don't give the worker a lot of disk space to work with.

If streaming fails, the worker will run out of resources and allocate a child job to handle the import with enough disk space.

adamnovak · 2024-11-15T23:34:28Z

src/toil/job.py

+        try:
+            return self.import_files(self.filenames, file_store.jobStore)
+        except OSError as e:
+            # If the worker crashes due to running out of disk space and was not trying to
+            # stream the file import, then try a new import job without streaming by actually giving
+            # the worker enough disk space
+            # OSError 28 is no space left on device
+            if e.errno == 28 and self.stream is True:
+                non_streaming_import = WorkerImportJob(
+                    self.filenames, self.disk_size, stream=False
+                )
+                self.addChild(non_streaming_import)
+                return non_streaming_import.rv()
+            else:
+                raise


I don't think this is a good approach; we don't always sandbox jobs to keep them within their requested disk space, so if one job goes over its disk request it can make a different job fail due to not having enough space for its temporary files.

This might honestly be better if we didn't handle the case where the attempted streaming import went over its disk space limit, and just made it the user's problem to set a disk space limit big enough if they have any imports that can't actually stream?

Or we could implement a flag on the job store import method to only allow streaming and fail if streaming is not possible, and catch that and do the fallback.

Also, if we hit the disk space limit after importing several files already, when do we delete those imported copies? It looks like we will leave them behind in the job store and then re-import the same files non-streaming with more disk space.

stxue1 added 5 commits October 16, 2024 00:16

Basic WDL implementation

b59bc2e

Fix bug with importworker options

ccea6b0

Asbtract WDL and CWL imports and enforce a better method

3cd4c61

Merge branch 'master' of github.com:DataBiosphere/toil into issues/51…

d16dc24

…14-filesize-sniffing

Remove unused import

642efd0

This was referenced Oct 24, 2024

Add FTP support to AbstractJobStore._get_size #5134

Open

Add leader import fallback to running imports on workers #5135

Open

stxue1 and others added 4 commits October 24, 2024 11:55

satisfy mypy

e712d88

Fix WDL import in CWL

e6c13fc

Fix WDL comments

ae1e279

Merge branch 'master' into issues/5114-filesize-sniffing

0c910f7

stxue1 marked this pull request as ready for review October 25, 2024 17:34

stxue1 and others added 4 commits October 30, 2024 08:24

Merge branch 'master' of github.com:DataBiosphere/toil into issues/51…

1eb4caf

…14-filesize-sniffing

move url functions to job, remove dead imports, format with black

1d89969

Merge master into issues/5114-filesize-sniffing

56b7a50

Merge master into issues/5114-filesize-sniffing

5eeb3b1

adamnovak requested changes Nov 1, 2024

View reviewed changes

github-actions bot and others added 10 commits November 5, 2024 07:11

Merge master into issues/5114-filesize-sniffing

d9d5ee9

Address comments

ff51526

mypy

b7ca42c

Merge branch 'issues/5114-filesize-sniffing' of github.com:DataBiosph…

2209cce

…ere/toil into issues/5114-filesize-sniffing

Detect running out of disk space with OSError instead

04ede19

Merge branch 'master' of github.com:DataBiosphere/toil into issues/51…

e5b5bbd

…14-filesize-sniffing

Fix logic in worker import loop

05a0898

Fix bad caching

9f4fe36

Merge branch 'master' of github.com:DataBiosphere/toil into issues/51…

206926b

…14-filesize-sniffing

Merge master into issues/5114-filesize-sniffing

eba8f37

mr-c reviewed Nov 14, 2024

View reviewed changes

src/toil/cwl/cwltoil.py Outdated Show resolved Hide resolved

github-actions bot added 2 commits November 14, 2024 15:04

Merge master into issues/5114-filesize-sniffing

f955721

Merge master into issues/5114-filesize-sniffing

f12c199

stxue1 and others added 4 commits November 14, 2024 12:50

Add wdl-conformance-tests to gitignore and sphinx config + remove uns…

645fd25

…ued imports in job + get rid of wdl dependency in job.py

Update src/toil/cwl/cwltoil.py

9dac988

Co-authored-by: Michael R. Crusoe <1330696+mr-c@users.noreply.github.com>

Merge master into issues/5114-filesize-sniffing

5c73743

Merge master into issues/5114-filesize-sniffing

6b144c9

adamnovak requested changes Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add filesize sniffing and parallelize importing on jobs #5133

Add filesize sniffing and parallelize importing on jobs #5133

stxue1 commented Oct 24, 2024

adamnovak left a comment

adamnovak Nov 1, 2024

stxue1 Nov 5, 2024

adamnovak Nov 1, 2024

stxue1 Nov 5, 2024

adamnovak Nov 1, 2024

adamnovak Nov 1, 2024

stxue1 Nov 5, 2024

adamnovak Nov 1, 2024

adamnovak Nov 1, 2024

adamnovak Nov 1, 2024

stxue1 Nov 6, 2024

adamnovak Nov 1, 2024

adamnovak Nov 1, 2024

adamnovak Nov 1, 2024

adamnovak left a comment

adamnovak Nov 15, 2024

adamnovak Nov 15, 2024

adamnovak Nov 15, 2024

adamnovak Nov 15, 2024

adamnovak Nov 15, 2024

adamnovak Nov 15, 2024

		importer: str \| None = None,
		execution_dir: str \| None = None,

		streaming, so if true, assume streaming works and don't give the worker a lot of disk space to work with.
		If streaming fails, the worker will run out of resources and allocate a child job to handle the import with enough disk space.

Add filesize sniffing and parallelize importing on jobs #5133

Are you sure you want to change the base?

Add filesize sniffing and parallelize importing on jobs #5133

Conversation

stxue1 commented Oct 24, 2024

Changelog Entry

Reviewer Checklist

Merger Checklist

adamnovak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamnovak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment