Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient reading of slices of a Dataset #6

Open
jjhelmus opened this issue Sep 7, 2016 · 2 comments
Open

Inefficient reading of slices of a Dataset #6

jjhelmus opened this issue Sep 7, 2016 · 2 comments

Comments

@jjhelmus
Copy link
Owner

jjhelmus commented Sep 7, 2016

When reading data from a Dataset, pyfive currently loads all chunks into memory before slicing the requested data. This behavior is inefficient when only a small region of the data is required which could be extracted from a small number or even a single chunk.

The code used for slicing dask arrays may be helpful for determining which chunks need to be read for the given slice.

@mrocklin
Copy link

I was looking into pyfive to read cloud-hosted data (we have Python file objects for S3, GCS, Azure) and was sad to learn that slicing doesn't happen cleverly.

@jjhelmus
Copy link
Owner Author

jjhelmus commented Jan 2, 2018

I believe the indexing code used in zarr could be adapted for use in pyfive to provide more efficient slicing.

bnlawrence pushed a commit to NCAS-CMS/pyfive that referenced this issue Feb 22, 2024
eventually, hopefully, address both the needs of
pyactivestorage (which needs access to the b-tree
chunk index) and jjhelmus#6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants