Add checksums to hdf5 datasets datasets #4831

GarethCabournDavies · 2024-07-30T15:09:50Z

Add fletcher32 checksums by default to hdf5 datasets

Standard information about the request

This is a new feature
This change affects everywhere that hdf5 file objects are used
This change follows style guidelines (See e.g. PEP8), has been proposed using the contribution guidelines

Motivation

This will check whether data has been corrupted in any way during the interim period from writing to loading.
From the documentation

Adds a checksum to each chunk to detect data corruption. Attempts to read corrupted chunks will fail with an error. No significant speed penalty. Obviously shouldn’t be used with lossy compression filters.

Implementation

Add a wrapper around the h5py Group object which does two things:

Reimplements the create_group function so that it points to this wrapper
adds the fletcher32 keyword to the create_dataset method
Add an inheritence to this wrapper to the HFile class, so that these are used in the file generally

Links to any issues or associated PRs

#1525 - (I was sorting issues in order, oldest first)

Testing performed

Run fit_sngls_over_multiparam - which uses both direct assignment on the file as well as create_group() and assignment to that, and check that the files are identical at the output.

File hashes did not match (as expected)
all attributes and datasets were identical (as expected)
Running h5ls -rv on the output we see that all datasets have the line Filter-0: fletcher32-3 {}

Same test on pycbc_add_statmap, same result

Additional notes

Some note about the implementation:

I don't really like how much of the create_group method's code is copied from the h5py source code, but couldn't see a nicer implementation.
the fact that h5py.File just inherits from h5py.Group made this much easier!

I can add compression to the create_dataset wrapper if wanted, and then we can replace the stuff in pycbc_coinc_statmap, though I'm not sure of the I/O penalty for that

The author of this pull request confirms they will adhere to the code of conduct

…te_dataset anyway)

GarethCabournDavies · 2024-08-02T08:55:30Z

The codeclimate issue is basically because I wanted to match the source code as closely as possible.

However this just looks like it is the HDF5 file locking mechanism, and if we just want to bypass that, I can remove it

spxiwh

My concern with this is that we've copied a block of code from Group.create_group in h5py into here. What happens if h5py change their own code? However, because h5py is using explicit class names, I don't see a better way to do this.

GarethCabournDavies · 2024-08-02T14:33:38Z

Thats my concern as well - I will try to keep on top of if there are any changes to that code (Ive set up a daily checker for changes to the github commit history)

titodalcanton · 2024-08-02T16:57:31Z

This will check whether data has been corrupted in any way during

…during what?

GarethCabournDavies · 2024-08-05T08:33:19Z

I missed that I had'nt finished the sentence!

... during the time between write and read

GarethCabournDavies added enhancement offline search PyGRB PyGRB development labels Jul 30, 2024

GarethCabournDavies requested review from ahnitz and spxiwh July 30, 2024 15:10

GarethCabournDavies added 2 commits July 31, 2024 06:45

Add a wrapping to h5py groups so that we can set checksumming

0e29bd3

use phil, wrap to create_dataset rather than setitem (which uses crea…

540728c

…te_dataset anyway)

GarethCabournDavies force-pushed the hfile_checksums branch from f4b073d to 540728c Compare July 31, 2024 15:18

Dont run fletcher32 for object dtypes

b669928

spxiwh reviewed Aug 2, 2024

View reviewed changes

spxiwh approved these changes Aug 2, 2024

View reviewed changes

GarethCabournDavies merged commit 235c03a into gwastro:master Aug 2, 2024
29 of 30 checks passed

GarethCabournDavies deleted the hfile_checksums branch August 2, 2024 14:57

GarethCabournDavies mentioned this pull request Aug 2, 2024

Check to see if our HDF5 files have checksums turned on #1525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add checksums to hdf5 datasets datasets #4831

Add checksums to hdf5 datasets datasets #4831

GarethCabournDavies commented Jul 30, 2024 •

edited

Loading

GarethCabournDavies commented Aug 2, 2024

spxiwh left a comment

GarethCabournDavies commented Aug 2, 2024

titodalcanton commented Aug 2, 2024

GarethCabournDavies commented Aug 5, 2024

Add checksums to hdf5 datasets datasets #4831

Add checksums to hdf5 datasets datasets #4831

Conversation

GarethCabournDavies commented Jul 30, 2024 • edited Loading

Standard information about the request

Motivation

Implementation

Links to any issues or associated PRs

Testing performed

Additional notes

GarethCabournDavies commented Aug 2, 2024

spxiwh left a comment

Choose a reason for hiding this comment

GarethCabournDavies commented Aug 2, 2024

titodalcanton commented Aug 2, 2024

GarethCabournDavies commented Aug 5, 2024

GarethCabournDavies commented Jul 30, 2024 •

edited

Loading