Skip to content

Commit

Permalink
Conformant ZarrV3 codecs and fill values (#193)
Browse files Browse the repository at this point in the history
* Generate chunk manifest backed variable from HDF5 dataset.

* Transfer dataset attrs to variable.

* Get virtual variables dict from HDF5 file.

* Update virtual_vars_from_hdf to use fsspec and drop_variables arg.

* mypy fix to use ChunkKey and empty dimensions list.

* Extract attributes from hdf5 root group.

* Use hdf reader for netcdf4 files.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix ruff complaints.

* First steps for handling HDF5 filters.

* Initial step for hdf5plugin supported codecs.

* Small commit to check compression support in CI environment.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix mypy complaints for hdf_filters.

* Local pre-commit fix for hdf_filters.

* Use fsspec reader_options introduced in #37.

* Fix incorrect zarr_v3 if block position from merge commit ef0d7a8.

* Fix early return from hdf _extract_attrs.

* Test that _extract_attrs correctly handles multiple attributes.

* Initial attempt at scale and offset via numcodecs.

* Tests for cfcodec_from_dataset.

* Temporarily relax integration tests to assert_allclose.

* Add blosc_lz4 fixture parameterization to confirm libnetcdf environment.

* Check for compatability with netcdf4 engine.

* Use separate fixtures for h5netcdf and netcdf4 compression styles.

* Print libhdf5 and libnetcdf4 versions to confirm compiled environment.

* Skip netcdf4 style compression tests when libhdf5 < 1.14.

* Include imagecodecs.numcodecs to support HDF5 lzf filters.

* Remove test that verifies call to read_kerchunk_references_from_file.

* Add additional codec support structures for imagecodecs and numcodecs.

* Add codec config test for Zstd.

* Include initial cf decoding tests.

* Revert typo for scale_factor retrieval.

* Update reader to use new numpy manifest representation.

* Temporarily skip test until blosc netcdf4 issue is solved.

* Fix Pydantic 2 migration warnings.

* Include hdf5plugin and imagecodecs-numcodecs in mamba test environment.

* Mamba attempt with imagecodecs rather than imagecodecs-numcodecs.

* Mamba attempt with latest imagecodecs release.

* Use correct iter_chunks callback function signtature.

* Include pip based imagecodecs-numcodecs until conda-forge availability.

* Handle non-coordinate dims which are serialized to hdf as empty dataset.

* Use reader_options for filetype check and update failing kerchunk call.

* Fix chunkmanifest shaping for chunked datasets.

* Handle scale_factor attribute serialization for compressed files.

* Include chunked roundtrip fixture.

* Standardize xarray integration tests for hdf filters.

* Update reader selection logic for new filetype determination.

* Use decode_times for integration test.

* Standardize fixture names for hdf5 vs netcdf4 file types.

* Handle array add_offset property for compressed data.

* Include h5py shuffle filter.

* Make ScaleAndOffset codec last in filters list.

* Apply ScaleAndOffset codec to _FillValue since it's value is now downstream.

* Coerce scale and add_offset values to native float for JSON serialization.

* Conformant ZarrV3 codecs

* Update docs

* Update virtualizarr/zarr.py

Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com>

* Update virtualizarr/zarr.py

Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com>

* Change default_fill to 0s

* Generate permutation

* Pythonic isinstance check

* Add return type to isconfigurable

Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com>

* Changes from pair programming for zarrv3 to kerchunk file reading

* Revert "Merge remote-tracking branch 'upstream/hdf5_reader' into codecs"

This reverts commit 7a65fbd, reversing
changes made to c051f04.

* Fix unit tests

* PR comments

* Remove kwarg in dict default

---------

Co-authored-by: sharkinsspatial <sharkinsgis@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com>
Co-authored-by: Tria McNeely <triamcnely@microsoft.com>
  • Loading branch information
5 people authored Jul 22, 2024
1 parent 0ad4de5 commit 10bd53d
Show file tree
Hide file tree
Showing 8 changed files with 203 additions and 46 deletions.
3 changes: 3 additions & 0 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ New Features
Breaking changes
~~~~~~~~~~~~~~~~

- Serialize valid ZarrV3 metadata and require full compressor numcodec config (for :pull:`193`)
By `Gustavo Hidalgo <https://github.com/ghidalgo3>`_.

Deprecations
~~~~~~~~~~~~

Expand Down
2 changes: 1 addition & 1 deletion virtualizarr/kerchunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ def variable_to_kerchunk_arr_refs(var: xr.Variable, var_name: str) -> KerchunkAr
for chunk_key, entry in marr.manifest.dict().items()
}

zarray = marr.zarray
zarray = marr.zarray.replace(zarr_format=2)

else:
try:
Expand Down
4 changes: 2 additions & 2 deletions virtualizarr/tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@ def create_manifestarray(

zarray = ZArray(
chunks=chunks,
compressor="zlib",
compressor={"id": "blosc", "clevel": 5, "cname": "lz4", "shuffle": 1},
dtype=np.dtype("float32"),
fill_value=0.0, # TODO change this to NaN?
fill_value=0.0,
filters=None,
order="C",
shape=shape,
Expand Down
2 changes: 1 addition & 1 deletion virtualizarr/tests/test_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ def test_non_dimension_coordinates(self, tmpdir, format):
# regression test for GH issue #105

# set up example xarray dataset containing non-dimension coordinate variables
ds = xr.Dataset(coords={"lat": (["x", "y"], np.arange(6).reshape(2, 3))})
ds = xr.Dataset(coords={"lat": (["x", "y"], np.arange(6.0).reshape(2, 3))})

# save it to disk as netCDF (in temporary directory)
ds.to_netcdf(f"{tmpdir}/non_dim_coords.nc")
Expand Down
12 changes: 6 additions & 6 deletions virtualizarr/tests/test_manifests/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def test_create_manifestarray(self):
shape = (5, 2, 20)
zarray = ZArray(
chunks=chunks,
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -74,7 +74,7 @@ def test_equals(self):
shape = (5, 2, 20)
zarray = ZArray(
chunks=chunks,
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand All @@ -95,7 +95,7 @@ def test_not_equal_chunk_entries(self):
# both manifest arrays in this example have the same zarray properties
zarray = ZArray(
chunks=(5, 1, 10),
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -209,7 +209,7 @@ def test_concat(self):
# both manifest arrays in this example have the same zarray properties
zarray = ZArray(
chunks=(5, 1, 10),
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -254,7 +254,7 @@ def test_stack(self):
# both manifest arrays in this example have the same zarray properties
zarray = ZArray(
chunks=(5, 10),
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -299,7 +299,7 @@ def test_refuse_combine():

zarray_common = {
"chunks": (5, 1, 10),
"compressor": "zlib",
"compressor": {"id": "zlib", "level": 1},
"dtype": np.dtype("int32"),
"fill_value": 0.0,
"filters": None,
Expand Down
10 changes: 5 additions & 5 deletions virtualizarr/tests/test_xarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def test_wrapping():
dtype = np.dtype("int32")
zarray = ZArray(
chunks=chunks,
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=dtype,
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -49,7 +49,7 @@ def test_equals(self):
shape = (5, 20)
zarray = ZArray(
chunks=chunks,
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -86,7 +86,7 @@ def test_concat_along_existing_dim(self):
# both manifest arrays in this example have the same zarray properties
zarray = ZArray(
chunks=(1, 10),
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -133,7 +133,7 @@ def test_concat_along_new_dim(self):
# both manifest arrays in this example have the same zarray properties
zarray = ZArray(
chunks=(5, 10),
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -183,7 +183,7 @@ def test_concat_dim_coords_along_existing_dim(self):
# both manifest arrays in this example have the same zarray properties
zarray = ZArray(
chunks=(10,),
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down
60 changes: 54 additions & 6 deletions virtualizarr/tests/test_zarr.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
import json

import numpy as np
import pytest
import xarray as xr
import xarray.testing as xrt

from virtualizarr import ManifestArray, open_virtual_dataset
from virtualizarr.manifests.manifest import ChunkManifest
from virtualizarr.zarr import dataset_to_zarr, metadata_from_zarr_json


def test_zarr_v3_roundtrip(tmpdir):
@pytest.fixture
def vds_with_manifest_arrays() -> xr.Dataset:
arr = ManifestArray(
chunkmanifest=ChunkManifest(
entries={"0.0": dict(path="test.nc", offset=6144, length=48)}
Expand All @@ -15,18 +20,61 @@ def test_zarr_v3_roundtrip(tmpdir):
shape=(2, 3),
dtype=np.dtype("<i8"),
chunks=(2, 3),
compressor=None,
compressor={"id": "zlib", "level": 1},
filters=None,
fill_value=np.nan,
fill_value=0,
order="C",
zarr_format=3,
),
)
original = xr.Dataset({"a": (["x", "y"], arr)}, attrs={"something": 0})
return xr.Dataset({"a": (["x", "y"], arr)}, attrs={"something": 0})


def isconfigurable(value: dict) -> bool:
"""
Several metadata attributes in ZarrV3 use a dictionary with keys "name" : str and "configuration" : dict
"""
return "name" in value and "configuration" in value

original.virtualize.to_zarr(tmpdir / "store.zarr")

def test_zarr_v3_roundtrip(tmpdir, vds_with_manifest_arrays: xr.Dataset):
vds_with_manifest_arrays.virtualize.to_zarr(tmpdir / "store.zarr")
roundtrip = open_virtual_dataset(
tmpdir / "store.zarr", filetype="zarr_v3", indexes={}
)

xrt.assert_identical(roundtrip, original)
xrt.assert_identical(roundtrip, vds_with_manifest_arrays)


def test_metadata_roundtrip(tmpdir, vds_with_manifest_arrays: xr.Dataset):
dataset_to_zarr(vds_with_manifest_arrays, tmpdir / "store.zarr")
zarray, _, _ = metadata_from_zarr_json(tmpdir / "store.zarr/a/zarr.json")
assert zarray == vds_with_manifest_arrays.a.data.zarray


def test_zarr_v3_metadata_conformance(tmpdir, vds_with_manifest_arrays: xr.Dataset):
"""
Checks that the output metadata of an array variable conforms to this spec
for the required attributes:
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#metadata
"""
dataset_to_zarr(vds_with_manifest_arrays, tmpdir / "store.zarr")
# read the a variable's metadata
with open(tmpdir / "store.zarr/a/zarr.json", mode="r") as f:
metadata = json.loads(f.read())
assert metadata["zarr_format"] == 3
assert metadata["node_type"] == "array"
assert isinstance(metadata["shape"], list) and all(
isinstance(dim, int) for dim in metadata["shape"]
)
assert isinstance(metadata["data_type"], str) or isconfigurable(
metadata["data_type"]
)
assert isconfigurable(metadata["chunk_grid"])
assert isconfigurable(metadata["chunk_key_encoding"])
assert isinstance(metadata["fill_value"], (bool, int, float, str, list))
assert (
isinstance(metadata["codecs"], list)
and len(metadata["codecs"]) > 1
and all(isconfigurable(codec) for codec in metadata["codecs"])
)
Loading

0 comments on commit 10bd53d

Please sign in to comment.