Inefficient reading of slices of a Dataset #6

jjhelmus · 2016-09-07T16:35:21Z

When reading data from a Dataset, pyfive currently loads all chunks into memory before slicing the requested data. This behavior is inefficient when only a small region of the data is required which could be extracted from a small number or even a single chunk.

The code used for slicing dask arrays may be helpful for determining which chunks need to be read for the given slice.

The text was updated successfully, but these errors were encountered:

mrocklin · 2017-12-29T17:26:18Z

I was looking into pyfive to read cloud-hosted data (we have Python file objects for S3, GCS, Azure) and was sad to learn that slicing doesn't happen cleverly.

jjhelmus · 2018-01-02T15:10:42Z

I believe the indexing code used in zarr could be adapted for use in pyfive to provide more efficient slicing.

eventually, hopefully, address both the needs of pyactivestorage (which needs access to the b-tree chunk index) and jjhelmus#6

jjhelmus added the enhancement label Sep 7, 2016

bnlawrence pushed a commit to NCAS-CMS/pyfive that referenced this issue Feb 22, 2024

Temporary class to explore the use of the chunk index, and

473fa36

eventually, hopefully, address both the needs of pyactivestorage (which needs access to the b-tree chunk index) and jjhelmus#6

bnlawrence mentioned this issue Mar 1, 2024

Issue6: support for loading only chunks needed for a selection #58

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient reading of slices of a Dataset #6

Inefficient reading of slices of a Dataset #6

jjhelmus commented Sep 7, 2016

mrocklin commented Dec 29, 2017

jjhelmus commented Jan 2, 2018

Inefficient reading of slices of a Dataset #6

Inefficient reading of slices of a Dataset #6

Comments

jjhelmus commented Sep 7, 2016

mrocklin commented Dec 29, 2017

jjhelmus commented Jan 2, 2018