Conformant ZarrV3 codecs and fill values (#193)

* Generate chunk manifest backed variable from HDF5 dataset. * Transfer dataset attrs to variable. * Get virtual variables dict from HDF5 file. * Update virtual_vars_from_hdf to use fsspec and drop_variables arg. * mypy fix to use ChunkKey and empty dimensions list. * Extract attributes from hdf5 root group. * Use hdf reader for netcdf4 files. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix ruff complaints. * First steps for handling HDF5 filters. * Initial step for hdf5plugin supported codecs. * Small commit to check compression support in CI environment. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix mypy complaints for hdf_filters. * Local pre-commit fix for hdf_filters. * Use fsspec reader_options introduced in #37. * Fix incorrect zarr_v3 if block position from merge commit ef0d7a8. * Fix early return from hdf _extract_attrs. * Test that _extract_attrs correctly handles multiple attributes. * Initial attempt at scale and offset via numcodecs. * Tests for cfcodec_from_dataset. * Temporarily relax integration tests to assert_allclose. * Add blosc_lz4 fixture parameterization to confirm libnetcdf environment. * Check for compatability with netcdf4 engine. * Use separate fixtures for h5netcdf and netcdf4 compression styles. * Print libhdf5 and libnetcdf4 versions to confirm compiled environment. * Skip netcdf4 style compression tests when libhdf5 < 1.14. * Include imagecodecs.numcodecs to support HDF5 lzf filters. * Remove test that verifies call to read_kerchunk_references_from_file. * Add additional codec support structures for imagecodecs and numcodecs. * Add codec config test for Zstd. * Include initial cf decoding tests. * Revert typo for scale_factor retrieval. * Update reader to use new numpy manifest representation. * Temporarily skip test until blosc netcdf4 issue is solved. * Fix Pydantic 2 migration warnings. * Include hdf5plugin and imagecodecs-numcodecs in mamba test environment. * Mamba attempt with imagecodecs rather than imagecodecs-numcodecs. * Mamba attempt with latest imagecodecs release. * Use correct iter_chunks callback function signtature. * Include pip based imagecodecs-numcodecs until conda-forge availability. * Handle non-coordinate dims which are serialized to hdf as empty dataset. * Use reader_options for filetype check and update failing kerchunk call. * Fix chunkmanifest shaping for chunked datasets. * Handle scale_factor attribute serialization for compressed files. * Include chunked roundtrip fixture. * Standardize xarray integration tests for hdf filters. * Update reader selection logic for new filetype determination. * Use decode_times for integration test. * Standardize fixture names for hdf5 vs netcdf4 file types. * Handle array add_offset property for compressed data. * Include h5py shuffle filter. * Make ScaleAndOffset codec last in filters list. * Apply ScaleAndOffset codec to _FillValue since it's value is now downstream. * Coerce scale and add_offset values to native float for JSON serialization. * Conformant ZarrV3 codecs * Update docs * Update virtualizarr/zarr.py Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com> * Update virtualizarr/zarr.py Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com> * Change default_fill to 0s * Generate permutation * Pythonic isinstance check * Add return type to isconfigurable Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com> * Changes from pair programming for zarrv3 to kerchunk file reading * Revert "Merge remote-tracking branch 'upstream/hdf5_reader' into codecs" This reverts commit 7a65fbd, reversing changes made to c051f04. * Fix unit tests * PR comments * Remove kwarg in dict default --------- Co-authored-by: sharkinsspatial <sharkinsgis@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com> Co-authored-by: Tria McNeely <triamcnely@microsoft.com>
zarr-developers · Jul 22, 2024 · 10bd53d · 10bd53d
1 parent 0ad4de5
commit 10bd53d
Show file tree

Hide file tree

Showing 8 changed files with 203 additions and 46 deletions.
diff --git a/docs/releases.rst b/docs/releases.rst
@@ -12,6 +12,9 @@ New Features
 Breaking changes
 ~~~~~~~~~~~~~~~~
 
+- Serialize valid ZarrV3 metadata and require full compressor numcodec config (for :pull:`193`)
+  By `Gustavo Hidalgo <https://github.com/ghidalgo3>`_.
+
 Deprecations
 ~~~~~~~~~~~~
 

diff --git a/virtualizarr/kerchunk.py b/virtualizarr/kerchunk.py
@@ -266,7 +266,7 @@ def variable_to_kerchunk_arr_refs(var: xr.Variable, var_name: str) -> KerchunkAr
             for chunk_key, entry in marr.manifest.dict().items()
         }
 
-        zarray = marr.zarray
+        zarray = marr.zarray.replace(zarr_format=2)
 
     else:
         try:

diff --git a/virtualizarr/tests/__init__.py b/virtualizarr/tests/__init__.py
@@ -48,9 +48,9 @@ def create_manifestarray(
 
     zarray = ZArray(
         chunks=chunks,
-        compressor="zlib",
+        compressor={"id": "blosc", "clevel": 5, "cname": "lz4", "shuffle": 1},
         dtype=np.dtype("float32"),
-        fill_value=0.0,  # TODO change this to NaN?
+        fill_value=0.0,
         filters=None,
         order="C",
         shape=shape,

diff --git a/virtualizarr/tests/test_integration.py b/virtualizarr/tests/test_integration.py
@@ -138,7 +138,7 @@ def test_non_dimension_coordinates(self, tmpdir, format):
         # regression test for GH issue #105
 
         # set up example xarray dataset containing non-dimension coordinate variables
-        ds = xr.Dataset(coords={"lat": (["x", "y"], np.arange(6).reshape(2, 3))})
+        ds = xr.Dataset(coords={"lat": (["x", "y"], np.arange(6.0).reshape(2, 3))})
 
         # save it to disk as netCDF (in temporary directory)
         ds.to_netcdf(f"{tmpdir}/non_dim_coords.nc")

diff --git a/virtualizarr/tests/test_manifests/test_array.py b/virtualizarr/tests/test_manifests/test_array.py
@@ -19,7 +19,7 @@ def test_create_manifestarray(self):
         shape = (5, 2, 20)
         zarray = ZArray(
             chunks=chunks,
-            compressor="zlib",
+            compressor={"id": "zlib", "level": 1},
             dtype=np.dtype("int32"),
             fill_value=0.0,
             filters=None,
@@ -74,7 +74,7 @@ def test_equals(self):
         shape = (5, 2, 20)
         zarray = ZArray(
             chunks=chunks,
-            compressor="zlib",
+            compressor={"id": "zlib", "level": 1},
             dtype=np.dtype("int32"),
             fill_value=0.0,
             filters=None,
@@ -95,7 +95,7 @@ def test_not_equal_chunk_entries(self):
         # both manifest arrays in this example have the same zarray properties
         zarray = ZArray(
             chunks=(5, 1, 10),
-            compressor="zlib",
+            compressor={"id": "zlib", "level": 1},
             dtype=np.dtype("int32"),
             fill_value=0.0,
             filters=None,
@@ -209,7 +209,7 @@ def test_concat(self):
         # both manifest arrays in this example have the same zarray properties
         zarray = ZArray(
             chunks=(5, 1, 10),
-            compressor="zlib",
+            compressor={"id": "zlib", "level": 1},
             dtype=np.dtype("int32"),
             fill_value=0.0,
             filters=None,
@@ -254,7 +254,7 @@ def test_stack(self):
         # both manifest arrays in this example have the same zarray properties
         zarray = ZArray(
             chunks=(5, 10),
-            compressor="zlib",
+            compressor={"id": "zlib", "level": 1},
             dtype=np.dtype("int32"),
             fill_value=0.0,
             filters=None,
@@ -299,7 +299,7 @@ def test_refuse_combine():
 
     zarray_common = {
         "chunks": (5, 1, 10),
-        "compressor": "zlib",
+        "compressor": {"id": "zlib", "level": 1},
         "dtype": np.dtype("int32"),
         "fill_value": 0.0,
         "filters": None,

diff --git a/virtualizarr/tests/test_xarray.py b/virtualizarr/tests/test_xarray.py
@@ -19,7 +19,7 @@ def test_wrapping():
     dtype = np.dtype("int32")
     zarray = ZArray(
         chunks=chunks,
-        compressor="zlib",
+        compressor={"id": "zlib", "level": 1},
         dtype=dtype,
         fill_value=0.0,
         filters=None,
@@ -49,7 +49,7 @@ def test_equals(self):
         shape = (5, 20)
         zarray = ZArray(
             chunks=chunks,
-            compressor="zlib",
+            compressor={"id": "zlib", "level": 1},
             dtype=np.dtype("int32"),
             fill_value=0.0,
             filters=None,
@@ -86,7 +86,7 @@ def test_concat_along_existing_dim(self):
         # both manifest arrays in this example have the same zarray properties
         zarray = ZArray(
             chunks=(1, 10),
-            compressor="zlib",
+            compressor={"id": "zlib", "level": 1},
             dtype=np.dtype("int32"),
             fill_value=0.0,
             filters=None,
@@ -133,7 +133,7 @@ def test_concat_along_new_dim(self):
         # both manifest arrays in this example have the same zarray properties
         zarray = ZArray(
             chunks=(5, 10),
-            compressor="zlib",
+            compressor={"id": "zlib", "level": 1},
             dtype=np.dtype("int32"),
             fill_value=0.0,
             filters=None,
@@ -183,7 +183,7 @@ def test_concat_dim_coords_along_existing_dim(self):
         # both manifest arrays in this example have the same zarray properties
         zarray = ZArray(
             chunks=(10,),
-            compressor="zlib",
+            compressor={"id": "zlib", "level": 1},
             dtype=np.dtype("int32"),
             fill_value=0.0,
             filters=None,

diff --git a/virtualizarr/tests/test_zarr.py b/virtualizarr/tests/test_zarr.py
@@ -1,12 +1,17 @@
+import json
+
 import numpy as np
+import pytest
 import xarray as xr
 import xarray.testing as xrt
 
 from virtualizarr import ManifestArray, open_virtual_dataset
 from virtualizarr.manifests.manifest import ChunkManifest
+from virtualizarr.zarr import dataset_to_zarr, metadata_from_zarr_json
 
 
-def test_zarr_v3_roundtrip(tmpdir):
+@pytest.fixture
+def vds_with_manifest_arrays() -> xr.Dataset:
     arr = ManifestArray(
         chunkmanifest=ChunkManifest(
             entries={"0.0": dict(path="test.nc", offset=6144, length=48)}
@@ -15,18 +20,61 @@ def test_zarr_v3_roundtrip(tmpdir):
             shape=(2, 3),
             dtype=np.dtype("<i8"),
             chunks=(2, 3),
-            compressor=None,
+            compressor={"id": "zlib", "level": 1},
             filters=None,
-            fill_value=np.nan,
+            fill_value=0,
             order="C",
             zarr_format=3,
         ),
     )
-    original = xr.Dataset({"a": (["x", "y"], arr)}, attrs={"something": 0})
+    return xr.Dataset({"a": (["x", "y"], arr)}, attrs={"something": 0})
+
+
+def isconfigurable(value: dict) -> bool:
+    """
+    Several metadata attributes in ZarrV3 use a dictionary with keys "name" : str and "configuration" : dict
+    """
+    return "name" in value and "configuration" in value
 
-    original.virtualize.to_zarr(tmpdir / "store.zarr")
+
+def test_zarr_v3_roundtrip(tmpdir, vds_with_manifest_arrays: xr.Dataset):
+    vds_with_manifest_arrays.virtualize.to_zarr(tmpdir / "store.zarr")
     roundtrip = open_virtual_dataset(
         tmpdir / "store.zarr", filetype="zarr_v3", indexes={}
     )
 
-    xrt.assert_identical(roundtrip, original)
+    xrt.assert_identical(roundtrip, vds_with_manifest_arrays)
+
+
+def test_metadata_roundtrip(tmpdir, vds_with_manifest_arrays: xr.Dataset):
+    dataset_to_zarr(vds_with_manifest_arrays, tmpdir / "store.zarr")
+    zarray, _, _ = metadata_from_zarr_json(tmpdir / "store.zarr/a/zarr.json")
+    assert zarray == vds_with_manifest_arrays.a.data.zarray
+
+
+def test_zarr_v3_metadata_conformance(tmpdir, vds_with_manifest_arrays: xr.Dataset):
+    """
+    Checks that the output metadata of an array variable conforms to this spec
+    for the required attributes:
+    https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#metadata
+    """
+    dataset_to_zarr(vds_with_manifest_arrays, tmpdir / "store.zarr")
+    # read the a variable's metadata
+    with open(tmpdir / "store.zarr/a/zarr.json", mode="r") as f:
+        metadata = json.loads(f.read())
+    assert metadata["zarr_format"] == 3
+    assert metadata["node_type"] == "array"
+    assert isinstance(metadata["shape"], list) and all(
+        isinstance(dim, int) for dim in metadata["shape"]
+    )
+    assert isinstance(metadata["data_type"], str) or isconfigurable(
+        metadata["data_type"]
+    )
+    assert isconfigurable(metadata["chunk_grid"])
+    assert isconfigurable(metadata["chunk_key_encoding"])
+    assert isinstance(metadata["fill_value"], (bool, int, float, str, list))
+    assert (
+        isinstance(metadata["codecs"], list)
+        and len(metadata["codecs"]) > 1
+        and all(isconfigurable(codec) for codec in metadata["codecs"])
+    )