-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate file transfer from recipe execution #95
Comments
So both Google Cloud and AWS provide "file transfer services" to / from cloud storage
Seems like we could pretty easily swap these out. |
This sounds a bit over-complicated, perhaps because I don't understand Globus. Can we manually start the Globus transfer, and then just write the recipe as if its source location is S3 (or whatever the destination of the transfer is?) This does harm the reproducibility of the recipe a bit, as it relies on this intermediate S3 store that's ephemeral. And then the recipe can choose to skip caching, since it's already in object storage. |
I did some reading about the library parsl and its support for file transfer / staging. https://parsl.readthedocs.io/en/stable/userguide/data.html#staging-data-files It seems to have a pretty flexible system for file staging which includes both http / ftp and Globus. In fact, looking through the parsl docs more broadly, it looks like a really useful thing for Pangeo forge. We could probably implement a parsl executor fairly easily. This could be particularly useful to NCAR folks, since parsl plays very well with HPC. They also apparently support cloud-based execution...so perhaps even an alternative to Prefect if we really get stuck. |
Currently, pangeo forge recipes explicitly handle the downloading (called "caching") of files from an external storage service (e.g. downloading a big list of files from HTTP or FTP). This happens in the
cache_input
loop.However, some data provides (e.g. NCAR; pangeo-forge/staged-recipes#8 (comment)) want us to transfer data using Globus. Globus works fundamentally differently from how Pangeo Forge is currently implemented. Right now, Pangeo Forge manages all downloads explicitly, open HTTP / FTP / etc connections to servers and downloading data directly.
Globus instead is essentially a Transfer-as-a-Service tool. We queue up a transfer and then Globus handles the actual movement of data using their system. Typically we would want to move 1000 netCDF files from NCAR glade to S3. When If we rely on Globus, we could basically eliminate the
cache_input
loop and just wait for the globus transfer to complete. However, we would still need a way to defer the recipe execution until after the transfer is complete. That would require some fancy CI logic.More generally, there could be other reasons to separate these steps. For example, downloading over HTTP from a slow server will not necessarily benefit from parallelism; if we launch 100 simulataneous download requests, we may just end up clobbering the poor HTTP server. Instead, a pub / sub model might work better for this. We could push files we want to download into a message queue, and a downloading service consumes these.
Question for @pangeo-forge/dev-team: should we consider removing the
cache_inputs
step from the recipe itself and moving this to a more flexible, standalone component?The text was updated successfully, but these errors were encountered: