Issue6: support for loading only chunks needed for a selection #58

bnlawrence · 2024-03-01T21:42:48Z

This code supports data access of the form dataset[4:6,12:15] to read only the necessary chunks of a chunked dataste to satisfy the request. It does not make any changes to behaviour for contiguous data access.

It makes use of a copy of the Zarr Indexer, which is pulled out of zarr to avoid a dependency on a large and active package together with a cached copy of the b-tree index (cached per dataset). There is a new subclass of Dataobjects to facilitate the data access. It is not clear whether new unit tests are necessary to test this, insofar as the existing unit tests do a getitem on chunked datsets anyway.

It provides a public API to the chunk addresses required by the indexed selection to support the requirements of another package where we are also trying to avoid depdencies on Zarr.

It addresses most of the requirements of #6, but I didn't dare address the contiguous data memory mapper ...

…er right yet.

…ven't got any tests around this yet.

bnlawrence · 2024-03-03T09:12:48Z

Ok, well, there's an embarassing error in that code, insofar as the new chunking code isn't used at all due to this line:

        data = self._dataobjects.get_data()[args]

which now should be

        data = self._dataobjects.get_data(args)

but isn't in this pull request. When it is corrected, I start to fail some unit tests, which I'll need to address before updating this pull request. Apologies.

(I guess the good news is that my expectation that the existing unit tests ought to test this code is at least partially correct.)

…changes to actually using the filter pipeline. At this point is failling test_reference.

bnlawrence · 2024-03-03T09:31:57Z

(Just one test still to investigate why it's failing.)

bnlawrence · 2024-03-03T09:47:50Z

I don't understand what HDF references are, so I have bypassed the "chunk by chunk" code when there is a dtype which is a tuple. Hopefully someone who looks at this pull request can decide whether that is the right thing to do or can improve on what is done here.

bmaranville · 2024-03-03T16:44:58Z

the Reference dtype corresponds an address that points to an object (Group, Dataset, user-registered Type, or Link), where each address is a 64-bit unsigned int. I suppose it's probably legal to store a Dataset of References in chunked mode - it would follow the normal rules of chunked storage, but it's raw addresses that are stored in the chunks, which should be loaded into Reference objects.

There are other dtypes that get read into a tuple in pyfive, like VLEN_STRING and VLEN_SEQUENCE (see https://github.com/jjhelmus/pyfive/blob/master/pyfive/dataobjects.py#L199) but they are not currently handled when reading Datasets, only Attributes. These are also pointer addresses, but the dereferencing algorithm is slightly different than for Reference objects I think (see the dereferencing code inline in the attribute reader mentioned above),

…lso remove list definition which breaks references.

bnlawrence · 2024-03-05T08:41:03Z

@bmaranville Ta. I am not sure whether it is worth persevering with that insofar as the current implementation in this branch simply does what it did before and it's not obvious to me whether you can avoid needing to load the entire set of references anyway, so there is no benefit in chunk by chunk reading in that case?

bmaranville · 2024-03-05T13:58:24Z

I agree, you can read reference datasets chunk by chunk if you want, but since each element is a pointer to another group or dataset it doesn't usually provide much performance benefit to having them in a chunked array (the data pointed to is probably much bigger than the pointers?).

bnlawrence · 2024-04-20T10:52:57Z

I'm going to refactor this to respect as much of the new API that H5py introduced in v3 as possible. Details are here: NCAS-CMS#4 . It'd be good to know how, when that's done, we can get that this into the main branch!

bnlawrence · 2024-04-26T10:55:38Z

Meanwhile, I'll close this pull request and generate a new one.

Bryan Lawrence and others added 14 commits February 22, 2024 12:20

Using s3 to get at some real data for testing

02fca54

Getting the address as well as size into the index

df3669a

With timer

16c0e81

Not working yet. Don't reckon I have the arguments to OrthogonalIndex…

c464be8

…er right yet.

A few more notes in the code so I can come back to it anon.

afaa4f5

Woops. Need this.

18bc37c

First working lazy read (only reads chunks needed for selection)

4b0ac08

Woops didnt' commit the real oil

5356aa0

Should now support filtering chunks in the partical chunk loading. Ha…

9fe2394

…ven't got any tests around this yet.

Some additional documentation

dafb3c9

Seems to work, prior to re-integration

53e4ebe

Moved chunk support into standard API

9ac0bbd

removing playing code

a88a150

Merge branch 'jjhelmus:master' into issue6

89aafe3

Fixes bug which stops the selection read from actually occurring and …

96dc178

…changes to actually using the filter pipeline. At this point is failling test_reference.

Hack to avoid reference datatypes in chunk by chunk selections.

eb44c15

Bryan Lawrence added 2 commits March 5, 2024 07:31

Remove obsolete function

51f7cca

Support for third party access to contiguous data address and size. A…

1f61d6c

…lso remove list definition which breaks references.

bnlawrence closed this Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue6: support for loading only chunks needed for a selection #58

Issue6: support for loading only chunks needed for a selection #58

bnlawrence commented Mar 1, 2024

bnlawrence commented Mar 3, 2024 •

edited

Loading

bnlawrence commented Mar 3, 2024

bnlawrence commented Mar 3, 2024

bmaranville commented Mar 3, 2024

bnlawrence commented Mar 5, 2024

bmaranville commented Mar 5, 2024

bnlawrence commented Apr 20, 2024

bnlawrence commented Apr 26, 2024

Issue6: support for loading only chunks needed for a selection #58

Issue6: support for loading only chunks needed for a selection #58

Conversation

bnlawrence commented Mar 1, 2024

bnlawrence commented Mar 3, 2024 • edited Loading

bnlawrence commented Mar 3, 2024

bnlawrence commented Mar 3, 2024

bmaranville commented Mar 3, 2024

bnlawrence commented Mar 5, 2024

bmaranville commented Mar 5, 2024

bnlawrence commented Apr 20, 2024

bnlawrence commented Apr 26, 2024

bnlawrence commented Mar 3, 2024 •

edited

Loading