i/o for hierarchical files #4

ericbarefoot · 2020-05-12T23:46:08Z

One of the main benefits of using data in HDF5 or NETCDF is that these formats are hierarchical, so the data can exist in a directory-like tree structure. This poses some challenges for our API, which relies on datasets being top-level objects in the database, when many use-cases would benefit from lumping say, imagery, hydrodynamic, or topographic data together.

Supporting groups of datasets this way will take some re-thinking of how we provide access to the data and think about what is, or is not, a variable. For example, at the moment, for NETCDF files, the io.keys() method returns the list of relevant variables accessed by the .variables method in netcdf. This list is good, but in HDF5, there is less of an optimization for any given data structure, and it's not built to have a simple way to go through and find all the datasets. So something like io.keys() on a hierarchical dataset as written would just call all the top level keys, be they datasets or groups of datasets.

Additionally, if we want to support the idea of giving scales to datasets, which I think we should, we need to wrap both the ways that netcdf does this, and the way hdf5 does this. Both are quite similar, but while netcdf makes it a little more high-level, hdf5 just has you supply a 1D array that lives as a separate database, and is used as the dimension scale. If you looked for all keys() in such a dataset, you would also get a key for each scale that was supplied.

I have a short laundry list of things that would be possibilities to facilitate this, and may also help us with slicing along scales and mapping from data world back to model world or experiment world.

We could make sure to allow the user to specify the full path in the directory tree to the dataset of interest. For example:

somecube = cube.DataCube('some_path.h5')
slc1 = somecube['eta']  # normal slice to top-level dataset
slc2 = somecube['imagery/ndvi']  # slice to second-level dataset

This is how h5py does it, and I like their syntax.

We could, potentially, allow users to 'register variables' similarly to how sections are registered.
Or we could just enforce that to be a valid cube dataset, all datasets have to be top level.

I dunno, these are all just ideas I'm having, and I welcome discussion on all of it.

The text was updated successfully, but these errors were encountered:

amoodie · 2020-05-13T14:30:54Z

This list is good, but in HDF5, there is less of an optimization for any given data structure, and it's not built to have a simple way to go through and find all the datasets.

Yeah, no feel free to change anything at all with the API. This is why I keep saying we need to settle on a standard data setup, though. I still think something like the following is a good standard

dataset
|  variable1
|  |  x
|  |  y
|  |  t
|  |  data
|  variable2
|  |  x
|  |  y
|  |  t
|  |  data
|  metadata
|  |  yaml_config
|  |  boundary_condition1_series

We need something of a standard to be able to write code to though. I implemented the keys() listing because that was what worked for the data I had, it's trivial to change it to work for a different data structure, but we need a standard.

if we want to support the idea of giving scales to datasets

Yeah, I think we should look closely at the xarray package (based on netcdf conventions) and potentially use this as our base ndarray instead of numpy. This will help with a lot of the dimensions I think.

slc2 = somecube['imagery/ndvi']

This should be supported "automatically" by implementing the __getitem__ method for the hdf5 io class (see netcdf example here)
We would have to write a way to provide the same accessing for data from the netcdf file though. Above all, the file IO should be exactly consistent to the user, whether they are connected to hdf5 or netcdf.

In general though, I think somecube['imagery/ndvi'] is worse than somecube['imagery']['ndvi'] because in order to loop all of the things under 'imagery' programatically, I need to combine strings with a slash and then pass it, then have hdf5 parse the string to access? Moreover, I think the latter is much more intuitive and readable. I think this "slash" indexing could be moved to another, lower priority issue.

ericbarefoot added the enhancement New feature or request label May 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i/o for hierarchical files #4

i/o for hierarchical files #4

ericbarefoot commented May 12, 2020

amoodie commented May 13, 2020

i/o for hierarchical files #4

i/o for hierarchical files #4

Comments

ericbarefoot commented May 12, 2020

amoodie commented May 13, 2020