Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cataloging pangeo-forge datasets #25

Open
rabernat opened this issue Oct 26, 2020 · 3 comments
Open

Cataloging pangeo-forge datasets #25

rabernat opened this issue Oct 26, 2020 · 3 comments
Labels
cataloging design question A question of the design of Pangeo Forge

Comments

@rabernat
Copy link
Contributor

When a dataset gets "published" by pangeo forge, we want to create a catalog entry for it. Some options for this catalog entry are:

  1. Intake (what we currently do)
  2. STAC
  3. Some custom database

Regardless of how we do, we will need to collect some meta-metadata from the pipeline creator about the dataset, such as:

  • License
  • Provenance
  • STAC stuff like "providers"

Pangeo forge should provide all the technical entries to the catalog like:

  • URL
  • Access protocol (e.g. zarr)
  • Requestor pays stuff
  • Cloud region

Related issue from @charlesbluca: in order to make zarr metadata browseable, we need

  • .zmetadata fully public (not requestor pays)
  • CORS set correctly
@charlesbluca
Copy link
Member

Thanks for opening this issue, Ryan. Just to make sure I'm on the right track here, a projected plan of implementing these features should look like:

  • Adding tasks intake.py, stac.py, csv.py, etc.:

    • Intake task would create an Intake catalog with the specified target as an entry, or append an entry to this catalog if it already exists
    • STAC task would create a new parent STAC catalog at the root of the target's bucket with a link to a new STAC collection representing the target (dependent on access protocol), or append the link to a new collection if the parent catalog already exists
    • CSV task would create a CSV file with the target as an entry (including specified column names/values), or append an entry if it already exists
  • Incorporating these tasks into new or existing pipelines:

    • A new pipeline would check the structure of previously published datasets, and create a catalog(s) for them
    • An addition to an existing pipeline would perform the creation of/appending to the catalogs immediately after the successful publishing of a dataset

In terms of input from the pipeline creator, I think some other meta-metadata that we could collect would be:

  • Catalog/collection location - this could default to the root of the bucket/target directory, but would be nice if we eventually want to create more complex nested catalogs for published datasets
  • Entry names - again, this could default to a name based on the target or source directories, but would move us closer to the user-generated catalogs we have on pangeo-datastore

Finally, if we opt to add these tasks to an existing pipeline, that may require an expansion to the current naming convention being considered - something like {source}_{container}_{target}_{catalogFormat}.py.

@rabernat
Copy link
Contributor Author

rabernat commented Nov 5, 2020

Thanks for this Charles!

I've been thinking about this issue A LOT... I think that, before we can integrate catalog updates into pangeo forge itself, we need to settle on a catalog format, structure, and creation / update procedure.

This brings us back to STAC...

I'm working on some ideas here and will send an update soon.

@charlesbluca
Copy link
Member

Excited to hear what ideas you have in mind - in the meantime, I'll be consulting PySTAC's API reference to get a feel for generating/updating absolute published catalogs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cataloging design question A question of the design of Pangeo Forge
Projects
None yet
Development

No branches or pull requests

2 participants