Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate file transfer from recipe execution #95

Open
rabernat opened this issue Apr 6, 2021 · 3 comments
Open

Separate file transfer from recipe execution #95

rabernat opened this issue Apr 6, 2021 · 3 comments

Comments

@rabernat
Copy link
Contributor

rabernat commented Apr 6, 2021

Currently, pangeo forge recipes explicitly handle the downloading (called "caching") of files from an external storage service (e.g. downloading a big list of files from HTTP or FTP). This happens in the cache_input loop.

However, some data provides (e.g. NCAR; pangeo-forge/staged-recipes#8 (comment)) want us to transfer data using Globus. Globus works fundamentally differently from how Pangeo Forge is currently implemented. Right now, Pangeo Forge manages all downloads explicitly, open HTTP / FTP / etc connections to servers and downloading data directly.

Globus instead is essentially a Transfer-as-a-Service tool. We queue up a transfer and then Globus handles the actual movement of data using their system. Typically we would want to move 1000 netCDF files from NCAR glade to S3. When If we rely on Globus, we could basically eliminate the cache_input loop and just wait for the globus transfer to complete. However, we would still need a way to defer the recipe execution until after the transfer is complete. That would require some fancy CI logic.

More generally, there could be other reasons to separate these steps. For example, downloading over HTTP from a slow server will not necessarily benefit from parallelism; if we launch 100 simulataneous download requests, we may just end up clobbering the poor HTTP server. Instead, a pub / sub model might work better for this. We could push files we want to download into a message queue, and a downloading service consumes these.

Question for @pangeo-forge/dev-team: should we consider removing the cache_inputs step from the recipe itself and moving this to a more flexible, standalone component?

@rabernat rabernat changed the title Globus transfers Separate file transfer from recipe execution Apr 12, 2021
@rabernat
Copy link
Contributor Author

So both Google Cloud and AWS provide "file transfer services" to / from cloud storage

Seems like we could pretty easily swap these out.

@TomAugspurger
Copy link
Contributor

When If we rely on Globus, we could basically eliminate the cache_input loop and just wait for the globus transfer to complete. However, we would still need a way to defer the recipe execution until after the transfer is complete. That would require some fancy CI logic.

This sounds a bit over-complicated, perhaps because I don't understand Globus. Can we manually start the Globus transfer, and then just write the recipe as if its source location is S3 (or whatever the destination of the transfer is?) This does harm the reproducibility of the recipe a bit, as it relies on this intermediate S3 store that's ephemeral.

And then the recipe can choose to skip caching, since it's already in object storage.

@rabernat
Copy link
Contributor Author

I did some reading about the library parsl and its support for file transfer / staging. https://parsl.readthedocs.io/en/stable/userguide/data.html#staging-data-files

It seems to have a pretty flexible system for file staging which includes both http / ftp and Globus.

In fact, looking through the parsl docs more broadly, it looks like a really useful thing for Pangeo forge. We could probably implement a parsl executor fairly easily. This could be particularly useful to NCAR folks, since parsl plays very well with HPC. They also apparently support cloud-based execution...so perhaps even an alternative to Prefect if we really get stuck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants