Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to merge metadata from individual netcdf files when using NetCDFtoZarrSequentialRecipe? #109

Open
naomi-henderson opened this issue Apr 27, 2021 · 2 comments

Comments

@naomi-henderson
Copy link
Contributor

When concatenating netcdf files into a single Zarr store, I need to preserve the unique identifier (called tracking_id in CMIP6) from each netcdf file and place them all into the zarr store's metadata. I have been reading them for all netcdf files and concatenating them with a newline separator, creating a netcdf_tracking_ids key in the dataset before saving as zarr. I am not sure how or if this can be done in the recipe, so am thinking of modifying the zarr metadata afterwards. The trouble is that I have to open all of the netcdf files again in order to get their tracking_ids.

In addition, I would like to add a new attribute (containing information such as the date when the recipe was used to create the new dataset). Could this be done in finalize_target?

Suggestions?

@rabernat
Copy link
Contributor

rabernat commented Apr 27, 2021

First, note that the recipe syntax and class names have changed considerably since #101. See the latest docs for the new syntax. This was a necessary refactor to simplify the internal code structure. Hopefully it will not cause too much confusion. You can get back to the old syntax if you want by using version 0.2.0.

What you want does not work right now but should be straightforward to implement. If nitems_per_input=None, that triggers calling of cache_input_metdata:

https://github.com/pangeo-forge/pangeo-forge/blob/5c80733a1002c3b5f281e373746c9bc580453204/pangeo_forge/recipes/xarray_zarr.py#L442-L447

This function dumps the json scheme for the file (including all metadata) into the metadata cache. So we have the tracking ids recorded already; no need to reopen the files again.

I see two general options for how we could implement this:

  • Add a new feature for "metadata merging" which would allow specifying of custom logic for how to combine the metadata from each input into the target metadata. You could imagine doing something like this:
    recipe = XarrayZarrRecipe(file_pattern, merge_metadata_opts={'tracking_id': 'concat'})
    or something along those links
  • Add a callback in either prepare_target or finalize_target that allows you to pass a function to do arbitrary things within those steps, similar to process_chunk and process_input. This would be more flexible and general, but would also make it easy to break things.

In addition, I would like to add a new attribute (containing information such as the date when the recipe was used to create the new dataset). Could this be done in finalize_target?

This will be necessary for the bakeries. They will want to sign the datasets they produce with several custom fields.

Again we have two options:

  • New option for arbitrary extra metadata to add to the recipe, e.g. XarrayZarrRecipe(..., extra_metadata={'creation_date': '2021-04-27'}
  • Support arbitrary callbacks in prepare_target or finalize_target.

@naomi-henderson do you have a preference?

@naomi-henderson
Copy link
Contributor Author

@rabernat , I like the simplicity of just adding options to the recipe rather than creating functions to pass. But functions would add more generality and avoid coding many special options, I suppose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants