Split kerchunk reader up #261

TomNicholas · 2024-10-17T23:57:27Z

Instead of trying to treat all uses of kerchunk references as one "reader", this PR instead splits them to consider each separate filetype to be one "reader", even if many of those readers call kerchunk code.

This means every "reader" is now a separate definition of a function open_virtual_dataset, which the top-level open_virtual_dataset picks between with one big match case statement. Generalizing that match to be pluggable by third-party libraries would close #245.

This also makes the structure of our dependence on kerchunk much clearer - @mpiannucci it should now be a lot easier for you to try out the kerchunk-to-icechunk use case mentioned in #258 (comment).

Closes Ensure every reader uses dataset_from_kerchunk_refs before returning a virtual dataset #257, addresses Make kerchunk dependency entirely optional #258 (comment), and is a big step towards Make readers pluggable via entrypoint system #245.
~~Tests added~~
Tests passing
Full type hint coverage
Changes are documented in docs/releases.rst
~~New functions/methods are listed in api.rst~~
~~New functionality has documentation~~

cc @norlandrhagen @sharkinsspatial @ghidalgo3

…ataset functions

TomNicholas · 2024-10-18T01:43:05Z

Note to self: I should just use an ABC with an open_virtual_dataset method defined at this point. That would standardize the interface, which would help shut mypy up, and it brings us closer towards what we would need for an entrypoint system.

EDIT: done

for more information, see https://pre-commit.ci

…mented

…s/VirtualiZarr into split_kerchunk_reader

TomNicholas · 2024-10-18T03:29:26Z

virtualizarr/readers/kerchunk.py

-from pathlib import Path
-from typing import Any, MutableMapping, Optional, cast
+from typing import Iterable, Mapping, Optional

+import ujson
 from xarray import Dataset
 from xarray.core.indexes import Index
-from xarray.core.variable import Variable

-from virtualizarr.backend import FileType, separate_coords
-from virtualizarr.manifests import ChunkManifest, ManifestArray
+from virtualizarr.readers.common import VirtualBackend
+from virtualizarr.translators.kerchunk import dataset_from_kerchunk_refs
 from virtualizarr.types.kerchunk import (
-    KerchunkArrRefs,
    KerchunkStoreRefs,
 )
 from virtualizarr.utils import _FsspecFSFromFilepath


Notice the kerchunk reader never imports kerchunk @mpiannucci

TomNicholas · 2024-10-18T03:30:48Z

virtualizarr/readers/common.py

+class VirtualBackend(ABC):
+    @staticmethod
+    def open_virtual_dataset(


I'm not sure that a @staticmethod is the best way to do this but for now it's just an internal implementation detail. The important thing is that the open_virtual_dataset function signature gets standardized by this approach.

…y where it doesn't exist

TomNicholas · 2024-10-18T13:57:07Z

@keewis you might have opinions on this PR - I'm trying to move towards a system of virtual "backend readers" like xarray has backends. My proposal (see #245) is to eventually have one actual xarray backend that calls virtualizarr, and virtualizarr calls one of a number of registered virtual backends depending on the filetype.

TomNicholas · 2024-10-18T18:31:04Z

Note: ensure warning is raised on decode_cf

virtualizarr/backend.py

keewis · 2024-10-18T20:48:22Z

you might have opinions on this PR

I'm still somewhat split on the idea to create a virtualizarr backend for xr.open_dataset: I think of xr.open_dataset as a way to open a dataset (i.e. metadata and data), while with virtualizarr we're not actually opening the dataset, we're just reading the metadata and the chunk locations. I'm aware that this distinction is somewhat blurred if you can load the actual data using the virtual dataset, but still I feel somewhat uncomfortable with that idea (no strong opposition, though).

However, I totally support the creation of a plugin system for open_virtual_dataset, as that means that the code for other packages does not have to live in the base package.

For an actual code review I'd have to read the changes you're proposing here, which I will try do at some point during the weekend (but the PR is huge, so might take me some time).

TomNicholas · 2024-10-18T20:52:29Z

I'm still somewhat split on the idea to create a virtualizarr backend for xr.open_dataset: I think of xr.open_dataset as a way to open a dataset (i.e. metadata and data), while with virtualizarr we're not actually opening the dataset, we're just reading the metadata and the chunk locations. I'm aware that this distinction is somewhat blurred if you can load the actual data using the virtual dataset, but still I feel somewhat uncomfortable with that idea (no strong opposition, though).

I agree actually. I'm still not sure that it's a good idea. Maybe open_virtual_dataset is sufficient.

However, I totally support the creation of a plugin system for open_virtual_dataset, as that means that the code for other packages does not have to live in the base package.

👍

For an actual code review I'd have to read the changes you're proposing here, which I will try do at some point during the weekend (but the PR is huge, so might take me some time).

No worries! I'm pretty sure this is fine, it can also always be refactored further.

keewis · 2024-10-18T20:53:26Z

(the failing tests are caused by a bad merge, as the changes to variable_from_kerchunk_refs from #260 disappeared)

…s/VirtualiZarr into split_kerchunk_reader

TomNicholas · 2024-10-19T00:12:45Z

the failing tests are caused by a bad merge

Thanks for that heads up!

Note: ensure warning is raised on decode_cf

Actually the warning should be on cf_variables. decode_times should be passed through, which it now is. But I'm confused why that didn't cause failures...

TomNicholas · 2024-10-19T00:38:00Z

virtualizarr/backend.py

+# TODO add entrypoint to allow external libraries to add to this mapping
+VIRTUAL_BACKENDS = {
+    "kerchunk": KerchunkVirtualBackend,
+    "zarr_v3": ZarrV3VirtualBackend,
+    "dmrpp": DMRPPVirtualBackend,
+    # all the below call one of the kerchunk backends internally (https://fsspec.github.io/kerchunk/reference.html#file-format-backends)
+    "netcdf3": NetCDF3VirtualBackend,
+    "hdf5": HDF5VirtualBackend,
+    "netcdf4": HDF5VirtualBackend,  # note this is the same as for hdf5
+    "tiff": TIFFVirtualBackend,
+    "fits": FITSVirtualBackend,
+}


@maxrjones I think this is what you were suggesting.

TomNicholas · 2024-10-19T00:42:01Z

I'm going to merge this now because I know @sharkinsspatial and @ayushnag would like to work off of it, but @keewis feel free to add any comments / thoughts here and I will address them in follow-up PRs.

TomNicholas added 5 commits October 17, 2024 17:16

standardize zarr v3 and dmrpp readers behind dedicated open_virtual_d…

89ff49a

…ataset functions

refactor hdf5 reader behind open_virtual_dataset function

8d6c42a

refactor netcdf3

eb7444e

refactor tiff

bb39907

refactor fits

97fc588

TomNicholas added Kerchunk Relating to the kerchunk library / specification itself references generation Reading byte ranges from archival files internals labels Oct 17, 2024

TomNicholas temporarily deployed to test-release October 17, 2024 23:58 — with GitHub Actions Inactive

TomNicholas added 2 commits October 17, 2024 22:33

refactored so create VirtualBackends

2e197e2

restore backend.py, but keep readers/common.py

f29d2ff

TomNicholas had a problem deploying to test-release October 18, 2024 02:50 — with GitHub Actions Failure

oops I deleted a file

5a8b18e

TomNicholas temporarily deployed to test-release October 18, 2024 02:51 — with GitHub Actions Inactive

[pre-commit.ci] auto fixes from pre-commit.com hooks

84330f0

for more information, see https://pre-commit.ci

pre-commit-ci bot temporarily deployed to test-release October 18, 2024 02:52 Inactive

TomNicholas added 4 commits October 17, 2024 22:59

standardize open_virtual_dataset method signature, and raise NotImple…

0e2fa71

…mented

fix bug with zarr reader

996d81a

remove todo

bf71ae3

Merge branch 'split_kerchunk_reader' of https://github.com/TomNichola…

79477a9

…s/VirtualiZarr into split_kerchunk_reader

TomNicholas temporarily deployed to test-release October 18, 2024 03:11 — with GitHub Actions Inactive

make open_virtual_dataset a staticmethod

fc2f3bc

TomNicholas temporarily deployed to test-release October 18, 2024 03:28 — with GitHub Actions Inactive

TomNicholas commented Oct 18, 2024

View reviewed changes

TomNicholas added 4 commits October 17, 2024 23:40

try to fix mypy error about importing DataTree from versions of xarra…

d955e1a

…y where it doesn't exist

mypy

4c5a2bb

sanitize drop_variables and loadable_variables

e592933

implement drop_variables for kerchunk reader

6a2179e

TomNicholas temporarily deployed to test-release October 18, 2024 04:42 — with GitHub Actions Inactive

TomNicholas added 2 commits October 18, 2024 00:47

pass all arguments to kerchunk reader

6bafd5b

coerce kerchunk refs to our types

b41e5d8

TomNicholas temporarily deployed to test-release October 18, 2024 04:48 — with GitHub Actions Inactive

TomNicholas mentioned this pull request Oct 18, 2024

Make kerchunk dependency entirely optional #258

Closed

TomNicholas commented Oct 18, 2024

View reviewed changes

virtualizarr/backend.py Outdated Show resolved Hide resolved

TomNicholas added 2 commits October 18, 2024 16:42

make sure all readers are passed the same set of args

bf78b84

Merge branch 'main' into split_kerchunk_reader

0ae7437

TomNicholas temporarily deployed to test-release October 18, 2024 20:45 — with GitHub Actions Inactive

TomNicholas added 2 commits October 18, 2024 19:29

fix bad merge, and refactor determine_chunk_grid_shape a bit

180a0fd

Merge branch 'split_kerchunk_reader' of https://github.com/TomNichola…

8b987c6

…s/VirtualiZarr into split_kerchunk_reader

TomNicholas temporarily deployed to test-release October 18, 2024 23:33 — with GitHub Actions Inactive

ensure decode_times is passed to each reader

edf0372

TomNicholas temporarily deployed to test-release October 19, 2024 00:11 — with GitHub Actions Inactive

TomNicholas added 2 commits October 18, 2024 20:30

remove match case statement in favour of mapping

55d152d

ensure optional dependencies aren't imported

f6c75da

TomNicholas temporarily deployed to test-release October 19, 2024 00:37 — with GitHub Actions Inactive

TomNicholas commented Oct 19, 2024

View reviewed changes

release note

7995b1c

TomNicholas temporarily deployed to test-release October 19, 2024 00:40 — with GitHub Actions Inactive

TomNicholas merged commit 29ca4ac into main Oct 19, 2024
10 checks passed

TomNicholas deleted the split_kerchunk_reader branch October 19, 2024 00:42

sharkinsspatial mentioned this pull request Oct 21, 2024

[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split kerchunk reader up #261

Split kerchunk reader up #261

TomNicholas commented Oct 17, 2024 •

edited

Loading

TomNicholas commented Oct 18, 2024 •

edited

Loading

TomNicholas Oct 18, 2024 •

edited

Loading

TomNicholas Oct 18, 2024

TomNicholas commented Oct 18, 2024

TomNicholas commented Oct 18, 2024

keewis commented Oct 18, 2024

TomNicholas commented Oct 18, 2024

keewis commented Oct 18, 2024

TomNicholas commented Oct 19, 2024

TomNicholas Oct 19, 2024

TomNicholas commented Oct 19, 2024

Split kerchunk reader up #261

Split kerchunk reader up #261

Conversation

TomNicholas commented Oct 17, 2024 • edited Loading

TomNicholas commented Oct 18, 2024 • edited Loading

TomNicholas Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

TomNicholas Oct 18, 2024

Choose a reason for hiding this comment

TomNicholas commented Oct 18, 2024

TomNicholas commented Oct 18, 2024

keewis commented Oct 18, 2024

TomNicholas commented Oct 18, 2024

keewis commented Oct 18, 2024

TomNicholas commented Oct 19, 2024

TomNicholas Oct 19, 2024

Choose a reason for hiding this comment

TomNicholas commented Oct 19, 2024

TomNicholas commented Oct 17, 2024 •

edited

Loading

TomNicholas commented Oct 18, 2024 •

edited

Loading

TomNicholas Oct 18, 2024 •

edited

Loading