Cataloging pangeo-forge datasets #25

rabernat · 2020-10-26T18:27:01Z

When a dataset gets "published" by pangeo forge, we want to create a catalog entry for it. Some options for this catalog entry are:

Intake (what we currently do)
STAC
Some custom database

Regardless of how we do, we will need to collect some meta-metadata from the pipeline creator about the dataset, such as:

License
Provenance
STAC stuff like "providers"

Pangeo forge should provide all the technical entries to the catalog like:

URL
Access protocol (e.g. zarr)
Requestor pays stuff
Cloud region

Related issue from @charlesbluca: in order to make zarr metadata browseable, we need

.zmetadata fully public (not requestor pays)
CORS set correctly

charlesbluca · 2020-11-04T19:59:55Z

Thanks for opening this issue, Ryan. Just to make sure I'm on the right track here, a projected plan of implementing these features should look like:

Adding tasks intake.py, stac.py, csv.py, etc.:
- Intake task would create an Intake catalog with the specified target as an entry, or append an entry to this catalog if it already exists
- STAC task would create a new parent STAC catalog at the root of the target's bucket with a link to a new STAC collection representing the target (dependent on access protocol), or append the link to a new collection if the parent catalog already exists
- CSV task would create a CSV file with the target as an entry (including specified column names/values), or append an entry if it already exists
Incorporating these tasks into new or existing pipelines:
- A new pipeline would check the structure of previously published datasets, and create a catalog(s) for them
- An addition to an existing pipeline would perform the creation of/appending to the catalogs immediately after the successful publishing of a dataset

In terms of input from the pipeline creator, I think some other meta-metadata that we could collect would be:

Catalog/collection location - this could default to the root of the bucket/target directory, but would be nice if we eventually want to create more complex nested catalogs for published datasets
Entry names - again, this could default to a name based on the target or source directories, but would move us closer to the user-generated catalogs we have on pangeo-datastore

Finally, if we opt to add these tasks to an existing pipeline, that may require an expansion to the current naming convention being considered - something like {source}_{container}_{target}_{catalogFormat}.py.

rabernat · 2020-11-05T17:10:05Z

Thanks for this Charles!

I've been thinking about this issue A LOT... I think that, before we can integrate catalog updates into pangeo forge itself, we need to settle on a catalog format, structure, and creation / update procedure.

This brings us back to STAC...

I'm working on some ideas here and will send an update soon.

charlesbluca · 2020-11-05T19:13:02Z

Excited to hear what ideas you have in mind - in the meantime, I'll be consulting PySTAC's API reference to get a feel for generating/updating absolute published catalogs.

charlesbluca mentioned this issue Nov 20, 2020

Rethinking Pangeo's data catalogs with consideration to pangeo-forge pangeo-data/pangeo-datastore#118

Closed

rabernat added design question A question of the design of Pangeo Forge cataloging labels Jan 22, 2021

rabernat mentioned this issue Feb 2, 2021

Documenting usage of datasets produced by pangeo-forge #72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cataloging pangeo-forge datasets #25

Cataloging pangeo-forge datasets #25

rabernat commented Oct 26, 2020

charlesbluca commented Nov 4, 2020

rabernat commented Nov 5, 2020

charlesbluca commented Nov 5, 2020

Cataloging pangeo-forge datasets #25

Cataloging pangeo-forge datasets #25

Comments

rabernat commented Oct 26, 2020

charlesbluca commented Nov 4, 2020

rabernat commented Nov 5, 2020

charlesbluca commented Nov 5, 2020