-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reference HDF recipe type #174
Conversation
The existing test fixtures, e.g. pangeo-forge-recipes/tests/conftest.py Lines 97 to 99 in e1749b7
Provide netCDF files that you could use to test this recipe...whenever you're ready. Let me know how I can help here. |
Note the interesting facet here: the recipe produces both a combined JSON file for the dataset, and an intake YAML file that knows how to load it. |
(NB: TODOs could become issues after merger, but not ones that I intend to work on immediately)
@rabernat , any thoughts here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great start!
My main observation is that you are assuming access to a shared global filesystem. This work for the test environment but won't work in cloud.
It would be great if you could push all i/o through the classes and function in storage.py
, e.g. MetadataTarget
, etc. That will allow these recipes to easily plug in to the bakeries. I know this may seem inconvenient, but I think there are advantages to having all i/o happen in one place.
|
||
def iter_chunks(self) -> Iterable[Hashable]: | ||
"""Iterate over all target chunks.""" | ||
return fsspec.open_files(self.netcdf_url, **self.netcdf_storage_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this return? Open file objects?
Would nice to plug in FilePattern here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns an OpenFiles
instance, which will open all contained files when encountering a with
, or can be iterated to expose individual OpenFile
instances.
What would we do with a FilePattern? We don't have logic to infer coordinates or add them into the chunking scheme. It would be nice in the long run, but tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would we do with a FilePattern?
Mostly the point is to align with the rest of Pangeo Forge.
I would recommend using this instead
pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py
Lines 882 to 884 in c486771
def iter_inputs(self) -> Iterator[InputKey]: | |
for input_key in self.file_pattern: | |
yield input_key |
and then using the input key to look up the url from the FilePattern object, e.g.
fname = file_pattern[input_key] |
|
||
def iter_chunks(self) -> Iterable[Hashable]: | ||
"""Iterate over all target chunks.""" | ||
return fsspec.open_files(self.netcdf_url, **self.netcdf_storage_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns an OpenFiles
instance, which will open all contained files when encountering a with
, or can be iterated to expose individual OpenFile
instances.
What would we do with a FilePattern? We don't have logic to infer coordinates or add them into the chunking scheme. It would be nice in the long run, but tricky.
FYI working on this today. |
@martindurant - I'm working on making a tutorial example for this recipe. I decided to use the CMIP6 netCDF files stored in this directory: The files are here import s3fs
fs = s3fs.S3FileSystem(anon=True)
base_path = 's3://esgf-world/CMIP6/OMIP/NOAA-GFDL/GFDL-CM4/omip1/r1i1p1f1/Omon/thetao/gr/v20180701/'
all_paths = fs.ls(base_path) I'm creating the following FilePattern from pangeo_forge_recipes.patterns import pattern_from_file_sequence
pattern = pattern_from_file_sequence(['s3://' + path for path in all_paths], 'time') And the following recipe from pangeo_forge_recipes.recipes.reference_hdf_zarr import ReferenceHDFRecipe
rec = ReferenceHDFRecipe(
pattern,
xarray_concat_args={"dim": "time"} # note that this could be discovered automatically from the pattern
) I assigned storage as follows import tempfile
from fsspec.implementations.local import LocalFileSystem
from pangeo_forge_recipes.storage import FSSpecTarget, MetadataTarget
fs_local = LocalFileSystem()
cache_dir = tempfile.TemporaryDirectory()
metadata_cache = MetadataTarget(fs_local, cache_dir.name)
target_dir = tempfile.TemporaryDirectory()
target = FSSpecTarget(fs_local, target_dir.name)
rec.target = target
rec.metadata_cache = metadata_cache And executed as follows from dask.diagnostics import ProgressBar
delayed = rec.to_dask()
with ProgressBar():
# took 1 hr 6 min on my computer 😱
delayed.compute() The files it produced ( I am able to open the file using fsspec + xarray import fsspec
m = fsspec.get_mapper(
"reference://",
fo=target._full_path("reference.json"),
target_protocol="file",
remote_protocol="s3",
skip_instance_cache=True,
)
ds = xr.open_dataset(m, engine='zarr', backend_kwargs={'consolidated': False}, chunks={})
print(ds)
🎉 This is great! However, I can't get the intake method to work. For reference, the intake catalog is just sources:
data:
args:
chunks: {}
consolidated: false
storage_options:
fo: /var/folders/n8/63q49ms55wxcj_gfbtykwp5r0000gn/T/tmp69uw_f7l/reference.json
remote_options: {}
remote_protocol: s3
skip_instance_cache: true
target_options: {}
target_protocol: file
urlpath: reference://
description: ''
driver: intake_xarray.xzarr.ZarrSource import intake
cat = intake.open_catalog(target._full_path("reference.yaml"))
cat.data.read() Gives this error ValueError: cannot reshape array of size 360 into shape (240,180,2)
Any idea what might be going on? |
I think I did exactly the same as you and got reasonable output:
or with the catalog file (having edited the url to
Doubtless my environment doesn't quite match yours? |
Thanks for the quick reply Martin! Yes, this does work for me. 👍 s = intake_xarray.xzarr.ZarrSource(
"reference://",
storage_options=dict(
fo=target._full_path("reference.json"),
remote_protocol="s3",
chunks={}
),
consolidated=False)
s.to_dask() And so does this! import intake
cat = intake.open_catalog(target._full_path("reference.yaml"))
cat.data.to_dask() The difference is that before I was calling |
And I found it... import intake
cat = intake.open_catalog(target._full_path("reference.yaml"))
ds = cat.data.to_dask()
ds.lat_bnds.values
# -> ValueError: cannot reshape array of size 360 into shape (240,180,2) |
OK, so there is a bug in reference-maker combine: In this case, the variable is constant and should not have gained the extra dimension. In MultiZarrToZarr, the variable should have fallen under case c), I can run debug to see why that didn't happen - do you have the original reference files around? A note: |
I think that in this specific case, you needed the extra argument at the creation of the aggregate JSON file. In general, you can indeed override any specific arguments when instantiating an intake source.
Unfortunately, if you wanted to update an argument that was embedded within another structure, such as one of the components of |
Will this close #70? |
@cisaacstern - yes and no. This kind of recipe certainly is one way of generating metadata, as described in the issue. I still think it's a valid discussion, though, unless we can convince ourselves that reference-maker does/can give all the metadata that we might want. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Just a note on some changes I've made: I feel like don't want to have optional dependencies for pangeo-forge-recipes right now. This introduces unnecessary complexity in our test suite and CI. So here I would just prefer to have fsspec-reference-maker, intake, and h5py become required dependencies. How does that sound? |
Intake/intake-xarray is only required for testing here, so it doesn't need to be in the formal requirements. You should be warned that the long-term for reference-maker is to support many file types (e.g., grib2), and you probably don't want to add all the dependencies. h5py should be the most common one, though, and we already use that. reference-maker will probably not depend on the various format-specific packages directly, only xarray, zarr. |
ci/py3.8.yml
Outdated
@@ -40,4 +40,4 @@ dependencies: | |||
- pip | |||
- pip: | |||
- pytest-timeout | |||
- git+https://github.com/intake/fsspec-reference-maker | |||
- sspec-reference-maker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing "f"
There is some weird stuff in the rtd build...
Maybe we should be pinning dependencies more? |
Something about the ci changes has made it take forever to build the python 3.8 environment. I'm going to try unpinning some versions and see if it improves. https://github.com/pangeo-forge/pangeo-forge-recipes/runs/3424372828?check_suite_focus=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martindurant - from my point of view this is good to go. Would appreciate your check on the items below.
Also your feedback on the tutorial and docs for the Reference recipe would be great.
@cisaacstern if you're interested, a review would be helpful here. When you approve, we can merge this PR and add our second Recipe class to the package! 🎉 🎊 Having other Recipe classes is very important because it will help us generalize some of the code patterns. For example, I'm very pleased we can reuse both the FilePattern and the Target stuff here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool.
My main question/suggestion is around the name of the new recipe class. Comment about that provided in-line on the recipe module.
from pangeo_forge_recipes.recipes.base import BaseRecipe | ||
from pangeo_forge_recipes.recipes.xarray_zarr import XarrayZarrRecipe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the ReferenceHDFRecipe
test module is called test_reference.py
, should this module be renamed to test_xarray_zarr.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can fix this in #167, which also includes test refactors.
I don't see anything remaining to be done here. |
No description provided.