Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checksums to hdf5 datasets datasets #4831

Merged

Conversation

GarethCabournDavies
Copy link
Contributor

@GarethCabournDavies GarethCabournDavies commented Jul 30, 2024

Add fletcher32 checksums by default to hdf5 datasets

Standard information about the request

This is a new feature
This change affects everywhere that hdf5 file objects are used
This change follows style guidelines (See e.g. PEP8), has been proposed using the contribution guidelines

Motivation

This will check whether data has been corrupted in any way during the interim period from writing to loading.
From the documentation

Adds a checksum to each chunk to detect data corruption. Attempts to read corrupted chunks will fail with an error. No significant speed penalty. Obviously shouldn’t be used with lossy compression filters.

Implementation

Add a wrapper around the h5py Group object which does two things:

  • Reimplements the create_group function so that it points to this wrapper
  • adds the fletcher32 keyword to the create_dataset method
    Add an inheritence to this wrapper to the HFile class, so that these are used in the file generally

Links to any issues or associated PRs

#1525 - (I was sorting issues in order, oldest first)

Testing performed

Run fit_sngls_over_multiparam - which uses both direct assignment on the file as well as create_group() and assignment to that, and check that the files are identical at the output.

  • File hashes did not match (as expected)
  • all attributes and datasets were identical (as expected)
  • Running h5ls -rv on the output we see that all datasets have the line Filter-0: fletcher32-3 {}

Same test on pycbc_add_statmap, same result

Additional notes

Some note about the implementation:

  • I don't really like how much of the create_group method's code is copied from the h5py source code, but couldn't see a nicer implementation.
  • the fact that h5py.File just inherits from h5py.Group made this much easier!

I can add compression to the create_dataset wrapper if wanted, and then we can replace the stuff in pycbc_coinc_statmap, though I'm not sure of the I/O penalty for that

  • The author of this pull request confirms they will adhere to the code of conduct

@GarethCabournDavies
Copy link
Contributor Author

The codeclimate issue is basically because I wanted to match the source code as closely as possible.

However this just looks like it is the HDF5 file locking mechanism, and if we just want to bypass that, I can remove it

Copy link
Contributor

@spxiwh spxiwh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with this is that we've copied a block of code from Group.create_group in h5py into here. What happens if h5py change their own code? However, because h5py is using explicit class names, I don't see a better way to do this.

@GarethCabournDavies
Copy link
Contributor Author

Thats my concern as well - I will try to keep on top of if there are any changes to that code (Ive set up a daily checker for changes to the github commit history)

@GarethCabournDavies GarethCabournDavies merged commit 235c03a into gwastro:master Aug 2, 2024
29 of 30 checks passed
@GarethCabournDavies GarethCabournDavies deleted the hfile_checksums branch August 2, 2024 14:57
@titodalcanton
Copy link
Contributor

This will check whether data has been corrupted in any way during

…during what?

@GarethCabournDavies
Copy link
Contributor Author

I missed that I had'nt finished the sentence!

... during the time between write and read

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants