Separate file transfer from recipe execution #95

rabernat · 2021-04-06T02:49:09Z

Currently, pangeo forge recipes explicitly handle the downloading (called "caching") of files from an external storage service (e.g. downloading a big list of files from HTTP or FTP). This happens in the cache_input loop.

However, some data provides (e.g. NCAR; pangeo-forge/staged-recipes#8 (comment)) want us to transfer data using Globus. Globus works fundamentally differently from how Pangeo Forge is currently implemented. Right now, Pangeo Forge manages all downloads explicitly, open HTTP / FTP / etc connections to servers and downloading data directly.

Globus instead is essentially a Transfer-as-a-Service tool. We queue up a transfer and then Globus handles the actual movement of data using their system. Typically we would want to move 1000 netCDF files from NCAR glade to S3. When If we rely on Globus, we could basically eliminate the cache_input loop and just wait for the globus transfer to complete. However, we would still need a way to defer the recipe execution until after the transfer is complete. That would require some fancy CI logic.

More generally, there could be other reasons to separate these steps. For example, downloading over HTTP from a slow server will not necessarily benefit from parallelism; if we launch 100 simulataneous download requests, we may just end up clobbering the poor HTTP server. Instead, a pub / sub model might work better for this. We could push files we want to download into a message queue, and a downloading service consumes these.

Question for @pangeo-forge/dev-team: should we consider removing the cache_inputs step from the recipe itself and moving this to a more flexible, standalone component?

The text was updated successfully, but these errors were encountered:

rabernat · 2021-04-13T13:58:26Z

So both Google Cloud and AWS provide "file transfer services" to / from cloud storage

AWS has several options; not sure which is best
GCS Storage Transfer (only supports HTTP)

Seems like we could pretty easily swap these out.

TomAugspurger · 2021-04-17T20:20:20Z

When If we rely on Globus, we could basically eliminate the cache_input loop and just wait for the globus transfer to complete. However, we would still need a way to defer the recipe execution until after the transfer is complete. That would require some fancy CI logic.

This sounds a bit over-complicated, perhaps because I don't understand Globus. Can we manually start the Globus transfer, and then just write the recipe as if its source location is S3 (or whatever the destination of the transfer is?) This does harm the reproducibility of the recipe a bit, as it relies on this intermediate S3 store that's ephemeral.

And then the recipe can choose to skip caching, since it's already in object storage.

rabernat · 2021-04-27T04:00:17Z

I did some reading about the library parsl and its support for file transfer / staging. https://parsl.readthedocs.io/en/stable/userguide/data.html#staging-data-files

It seems to have a pretty flexible system for file staging which includes both http / ftp and Globus.

In fact, looking through the parsl docs more broadly, it looks like a really useful thing for Pangeo forge. We could probably implement a parsl executor fairly easily. This could be particularly useful to NCAR folks, since parsl plays very well with HPC. They also apparently support cloud-based execution...so perhaps even an alternative to Prefect if we really get stuck.

rabernat changed the title ~~Globus transfers~~ Separate file transfer from recipe execution Apr 12, 2021

rabernat mentioned this issue Oct 15, 2021

Transfer inputs using Globus #222

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate file transfer from recipe execution #95

Separate file transfer from recipe execution #95

rabernat commented Apr 6, 2021 •

edited

Loading

rabernat commented Apr 13, 2021

TomAugspurger commented Apr 17, 2021

rabernat commented Apr 27, 2021

Separate file transfer from recipe execution #95

Separate file transfer from recipe execution #95

Comments

rabernat commented Apr 6, 2021 • edited Loading

rabernat commented Apr 13, 2021

TomAugspurger commented Apr 17, 2021

rabernat commented Apr 27, 2021

rabernat commented Apr 6, 2021 •

edited

Loading