Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata-only outputs #70

Open
martindurant opened this issue Feb 2, 2021 · 2 comments
Open

metadata-only outputs #70

martindurant opened this issue Feb 2, 2021 · 2 comments
Labels
design question A question of the design of Pangeo Forge input formats output formats

Comments

@martindurant
Copy link
Contributor

The typical workflow discussed in most other issues is around taking some dataset, and transforming it into a zarr format for storage - with a number of options about how to go about doing that.

Here I want to make note of an alternative, where the final data product will still be loading from the original source, and the output of the pangeo-forge process is a prescription for how to go about it. The principal use cases for this are:

  • data sets that are constantly changing, and so would need to be run repeatedly through the forge with "append" if it were to be made into a single dataset
  • the original data is too big to be replicated, and most analysis would require only small sections of it
  • making different cuts or views of some very large data set
  • information is encoded in the file naming convention of the original

There are two broad categories of data access considered, for now

  • loading binary chunks from the target. This is the idea behind fsspec-reference-maker, whereby we find binary blocks within some dataset, and assign zarr chunk keys to each. The example included is for a single HDF5 file; but the idea could be extended to many files and many formats, so long as the the codec is implemented in (or easily added to) numcodecs. This works well for the domain where the original binary chunks are large enough. The downside is having to extract the byte offsets from the original, requiring a complete read. Once scanned, the original dataset becomes available to zarr directly, without need for the libraries that did the scanning.
  • loading whole files. This is the idea behind intake_informaticslab. In this case, the files are loaded using the library appropriate for the original data format, which might require temporary local storage (e.g., for grib2, where the code is in C and needs a file handle). However a set of data files is still expressed as zarr, one file per chunk, with a custom zarr storage driver. In the examples, the mapping of zarr keys to file locations is specific to each dataset and dependent on the naming conventions. This access pattern requires an understanding of the target file layout, but no scanning of the files. It does need distribution of the zarr storage layer to do the key mapping and temporary file download - but the filename mapping could be expressed declaratively rather than in code. This pattern would be required for the case that binary blocks in the original cannot be decoded directly, or that there are many small blocks in each file, so direct access would be very inefficient.
@martindurant martindurant added design question A question of the design of Pangeo Forge input formats output formats labels Feb 2, 2021
@rabernat
Copy link
Contributor

rabernat commented Feb 2, 2021

I think this is a great pattern we should definitely work to support! 👍 These recipes will generally be a bit cheaper to run because they don't have to copy much data.

Please feel free to take a stab at implementing such a recipe class. It would be good to have an issue in staged-recipes to point to a specific dataset we can use as a user story.

@martindurant
Copy link
Contributor Author

cc @tam203 - you might be interested eventually encoding your datasets into pangeo-forge recipes or, more simply, including your existing catalogue prescriptions. I have not yet had the chance to look through the code of Hypothetic in detail, to have a good model for myself of the components (filename convention versus zarr chunk key; zarr storage; intake driver; download/cache layer).

@rabernat
The simplest case to encode would be the existing example in https://github.com/intake/fsspec-reference-maker/blob/main/examples/intake_catalog.yml , and the reference file specified therein. The recipe would essentially repeat that scan for the latest capabilities of fsspec-reference-maker as it evolves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design question A question of the design of Pangeo Forge input formats output formats
Projects
None yet
Development

No branches or pull requests

2 participants