-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue6: support for loading only chunks needed for a selection #58
Conversation
…ven't got any tests around this yet.
Ok, well, there's an embarassing error in that code, insofar as the new chunking code isn't used at all due to this line:
which now should be
but isn't in this pull request. When it is corrected, I start to fail some unit tests, which I'll need to address before updating this pull request. Apologies. (I guess the good news is that my expectation that the existing unit tests ought to test this code is at least partially correct.) |
…changes to actually using the filter pipeline. At this point is failling test_reference.
(Just one test still to investigate why it's failing.) |
I don't understand what HDF references are, so I have bypassed the "chunk by chunk" code when there is a dtype which is a tuple. Hopefully someone who looks at this pull request can decide whether that is the right thing to do or can improve on what is done here. |
the There are other dtypes that get read into a tuple in pyfive, like VLEN_STRING and VLEN_SEQUENCE (see https://github.com/jjhelmus/pyfive/blob/master/pyfive/dataobjects.py#L199) but they are not currently handled when reading Datasets, only Attributes. These are also pointer addresses, but the dereferencing algorithm is slightly different than for Reference objects I think (see the dereferencing code inline in the attribute reader mentioned above), |
…lso remove list definition which breaks references.
@bmaranville Ta. I am not sure whether it is worth persevering with that insofar as the current implementation in this branch simply does what it did before and it's not obvious to me whether you can avoid needing to load the entire set of references anyway, so there is no benefit in chunk by chunk reading in that case? |
I agree, you can read reference datasets chunk by chunk if you want, but since each element is a pointer to another group or dataset it doesn't usually provide much performance benefit to having them in a chunked array (the data pointed to is probably much bigger than the pointers?). |
I'm going to refactor this to respect as much of the new API that H5py introduced in v3 as possible. Details are here: NCAS-CMS#4 . It'd be good to know how, when that's done, we can get that this into the main branch! |
Meanwhile, I'll close this pull request and generate a new one. |
This code supports data access of the form
dataset[4:6,12:15]
to read only the necessary chunks of a chunked dataste to satisfy the request. It does not make any changes to behaviour for contiguous data access.It makes use of a copy of the Zarr Indexer, which is pulled out of zarr to avoid a dependency on a large and active package together with a cached copy of the b-tree index (cached per dataset). There is a new subclass of
Dataobjects
to facilitate the data access. It is not clear whether new unit tests are necessary to test this, insofar as the existing unit tests do a getitem on chunked datsets anyway.It provides a public API to the chunk addresses required by the indexed selection to support the requirements of another package where we are also trying to avoid depdencies on Zarr.
It addresses most of the requirements of #6, but I didn't dare address the contiguous data memory mapper ...