-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add test for truncated grib file #418
Conversation
Actually the errors reported are not quite what you say.. But I am a little confused: what's the real reason for the failure:
|
Here is the error in the test runner logs
This - there are corrupted or truncated files in the NODD GCS buckets. I need to build a robust ML platform on top of this data and I need to tolerate bad values in my model features. Not either of these - issues in the tool chain I would like to cause immediate failure.
With the hack to zarr core you can see the bad chunks in the ~3 year GFS aggregation I am building So yes - ideally I will collect the bad urls and send them back to NODD to ask for a fix, but I can't let one bad file take down my near real time ML platform. |
OK, I see. Skipping bad IO and returning a list of inputs for which it failed (with reason) is actually something we do in dask-awkward, but there you can return an empty array |
Reads docs.... But yes, logging is a reasonable start - I would love to have more access to the context to put the error handler in the codec (as you suggested out of band). |
Actually no, regular arrays in awkward are just numpy. The chunking is only over the outermost dimension, however, probably doesn't fit with what you are doing. I was mildly suggesting that the "skip and report" technique (not really documented, only used downstream package coffea) could be reused, but I think that's too much effort for what you are doing. |
So probably we want to close this? The codec is probably the wrong place to decide to ignore a read/decode failure, since it's hard to pass in the details of the array metadata or user runtime config. |
Sure - I will reopen it if and when a solution gets merged to the main line in zarr. |
I keep hitting truncated or corrupted grib2 files in the NODD cloud archive.
For example: gs://high-resolution-rapid-refresh/hrrr.20230914/conus/hrrr.t11z.wrfsfcf12.grib2
The bucket doesn't have version history, but I was able to generate a zarr file with kerchunk that contained the references in the test code.
I keep getting error messages like
ECCODES ERROR : grib_handle_new_from_message: No final 7777 in message!
Followed by a traceback ending in:
I added that grib file in this PR and a test that reproduces these errors. I can't see how to return the fill value from the kerchunk codec though? I found a hatchet job solution in zarr core. Any suggestions on how to resolve these issues when working with large external archives?