add test for truncated grib file #418

emfdavid · 2024-02-12T03:59:21Z

I keep hitting truncated or corrupted grib2 files in the NODD cloud archive.
For example: gs://high-resolution-rapid-refresh/hrrr.20230914/conus/hrrr.t11z.wrfsfcf12.grib2

The bucket doesn't have version history, but I was able to generate a zarr file with kerchunk that contained the references in the test code.

I keep getting error messages like ECCODES ERROR : grib_handle_new_from_message: No final 7777 in message!
Followed by a traceback ending in:

"/app/forecasting/execution/training_server.runfiles/ritta/forecasting/time_series_models/data_fetchers/nodd_fetcher.py", line 722, in _load_from_source
    return load_nodd_data(
  File "/app/forecasting/execution/training_server.runfiles/common_deps_joblib/site-packages/joblib/memory.py", line 655, in __call__
    return self._cached_call(args, kwargs)[0]
  File "/app/forecasting/execution/training_server.runfiles/common_deps_joblib/site-packages/joblib/memory.py", line 598, in _cached_call
    out, metadata = self.call(*args, **kwargs)
  File "/app/forecasting/execution/training_server.runfiles/common_deps_joblib/site-packages/joblib/memory.py", line 856, in call
    output = self.func(*args, **kwargs)
  File "/app/forecasting/execution/training_server.runfiles/ritta/forecasting/time_series_models/data_fetchers/nodd_fetcher.py", line 142, in load_nodd_data
    interped = ds.interp(**interp_kwargs)
  File "/app/forecasting/execution/training_server.runfiles/common_deps_xarray/site-packages/xarray/core/dataset.py", line 3995, in interp
    if is_duck_dask_array(var.data):
  File "/app/forecasting/execution/training_server.runfiles/common_deps_xarray/site-packages/xarray/core/variable.py", line 416, in data
    return self._data.get_duck_array()
  File "/app/forecasting/execution/training_server.runfiles/common_deps_xarray/site-packages/xarray/core/indexing.py", line 699, in get_duck_array
    self._ensure_cached()
  File "/app/forecasting/execution/training_server.runfiles/common_deps_xarray/site-packages/xarray/core/indexing.py", line 693, in _ensure_cached
    self.array = as_indexable(self.array.get_duck_array())
  File "/app/forecasting/execution/training_server.runfiles/common_deps_xarray/site-packages/xarray/core/indexing.py", line 667, in get_duck_array
    return self.array.get_duck_array()
  File "/app/forecasting/execution/training_server.runfiles/common_deps_xarray/site-packages/xarray/core/indexing.py", line 554, in get_duck_array
    array = self.array[self.key]
  File "/app/forecasting/execution/training_server.runfiles/common_deps_xarray/site-packages/xarray/backends/zarr.py", line 99, in __getitem__
    return indexing.explicit_indexing_adapter(
  File "/app/forecasting/execution/training_server.runfiles/common_deps_xarray/site-packages/xarray/core/indexing.py", line 861, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)
  File "/app/forecasting/execution/training_server.runfiles/common_deps_xarray/site-packages/xarray/backends/zarr.py", line 87, in _oindex
    return self._array.oindex[key]
  File "/app/forecasting/execution/training_server.runfiles/common_deps_zarr/site-packages/zarr/indexing.py", line 664, in __getitem__
    return self.array.get_orthogonal_selection(selection, fields=fields)
  File "/app/forecasting/execution/training_server.runfiles/common_deps_zarr/site-packages/zarr/core.py", line 1000, in get_orthogonal_selection
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "/app/forecasting/execution/training_server.runfiles/common_deps_zarr/site-packages/zarr/core.py", line 1179, in _get_selection
    self._chunk_getitem(chunk_coords, chunk_selection, out, out_selection,
  File "/app/forecasting/execution/training_server.runfiles/common_deps_zarr/site-packages/zarr/core.py", line 1883, in _chunk_getitem
    self._process_chunk(out, cdata, chunk_selection, drop_axes,
  File "/app/forecasting/execution/training_server.runfiles/common_deps_zarr/site-packages/zarr/core.py", line 1826, in _process_chunk
    chunk = self._decode_chunk(cdata)
  File "/app/forecasting/execution/training_server.runfiles/common_deps_zarr/site-packages/zarr/core.py", line 2083, in _decode_chunk
    chunk = f.decode(chunk)
  File "/app/forecasting/execution/training_server.runfiles/common_deps_kerchunk/site-packages/kerchunk/codecs.py", line 101, in decode
    data = eccodes.codes_get_array(mid, var)
  File "/app/forecasting/execution/training_server.runfiles/common_deps_eccodes/site-packages/gribapi/gribapi.py", line 2020, in grib_get_array
    ktype = grib_get_native_type(msgid, key)
  File "/app/forecasting/execution/training_server.runfiles/common_deps_eccodes/site-packages/gribapi/gribapi.py", line 1956, in grib_get_native_type
    GRIB_CHECK(err)
  File "/app/forecasting/execution/training_server.runfiles/common_deps_eccodes/site-packages/gribapi/gribapi.py", line 226, in GRIB_CHECK
    errors.raise_grib_error(errid)
  File "/app/forecasting/execution/training_server.runfiles/common_deps_eccodes/site-packages/gribapi/errors.py", line 381, in raise_grib_error
    raise ERROR_MAP[errid](errid)
gribapi.errors.KeyValueNotFoundError: Key/value not found

I added that grib file in this PR and a test that reproduces these errors. I can't see how to return the fill value from the kerchunk codec though? I found a hatchet job solution in zarr core. Any suggestions on how to resolve these issues when working with large external archives?

martindurant · 2024-02-12T14:29:44Z

Actually the errors reported are not quite what you say..

But I am a little confused: what's the real reason for the failure:

the target file is really corrupted/truncated, but we are happy with all messages previous to the affected one, or
there is something intermittent in the chain, or
somehow kerchunk is causing the corrupted read?

emfdavid · 2024-02-12T15:19:50Z

Actually the errors reported are not quite what you say..

Here is the error in the test runner logs gribapi.errors.KeyValueNotFoundError: Key/value not found
Depending on what is happening with stdout/stderr you should see the ECCODES error as well. Exact behavior depends on the file system as well - reading past the end of gcs file is different than reading past the end of a local file.

But I am a little confused: what's the real reason for the failure:

the target file is really corrupted/truncated, but we are happy with all messages previous to the affected one, or

This - there are corrupted or truncated files in the NODD GCS buckets.

I need to build a robust ML platform on top of this data and I need to tolerate bad values in my model features.

Not either of these - issues in the tool chain I would like to cause immediate failure.

there is something intermittent in the chain, or

somehow kerchunk is causing the corrupted read?

With the hack to zarr core you can see the bad chunks in the ~3 year GFS aggregation I am building

So yes - ideally I will collect the bad urls and send them back to NODD to ask for a fix, but I can't let one bad file take down my near real time ML platform.

martindurant · 2024-02-12T15:23:08Z

OK, I see. Skipping bad IO and returning a list of inputs for which it failed (with reason) is actually something we do in dask-awkward, but there you can return an empty array [] rather than having to fill nans. Since logging solves it for you, there's no need for that extra complexity.

emfdavid · 2024-02-12T15:33:32Z

dask-awkward

Reads docs....
Okay - yes - that is amazing. I don't want awkward arrays - I assume there would be pretty big performance penalty for regular arrays?

But yes, logging is a reasonable start - I would love to have more access to the context to put the error handler in the codec (as you suggested out of band).

martindurant · 2024-02-12T16:09:07Z

I assume there would be pretty big performance penalty for regular arrays?

Actually no, regular arrays in awkward are just numpy. The chunking is only over the outermost dimension, however, probably doesn't fit with what you are doing. I was mildly suggesting that the "skip and report" technique (not really documented, only used downstream package coffea) could be reused, but I think that's too much effort for what you are doing.

martindurant · 2024-02-12T18:20:26Z

So probably we want to close this? The codec is probably the wrong place to decide to ignore a read/decode failure, since it's hard to pass in the details of the array metadata or user runtime config.

emfdavid · 2024-02-12T21:50:56Z

Sure - I will reopen it if and when a solution gets merged to the main line in zarr.
The artifact is here for anyone else who its the issue.

add test for truncated grib file

1ea1003

emfdavid closed this Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add test for truncated grib file #418

add test for truncated grib file #418

emfdavid commented Feb 12, 2024

martindurant commented Feb 12, 2024

emfdavid commented Feb 12, 2024 •

edited

Loading

martindurant commented Feb 12, 2024

emfdavid commented Feb 12, 2024

martindurant commented Feb 12, 2024

martindurant commented Feb 12, 2024

emfdavid commented Feb 12, 2024

add test for truncated grib file #418

add test for truncated grib file #418

Conversation

emfdavid commented Feb 12, 2024

martindurant commented Feb 12, 2024

emfdavid commented Feb 12, 2024 • edited Loading

martindurant commented Feb 12, 2024

emfdavid commented Feb 12, 2024

martindurant commented Feb 12, 2024

martindurant commented Feb 12, 2024

emfdavid commented Feb 12, 2024

emfdavid commented Feb 12, 2024 •

edited

Loading