Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expressive handler for fsspec block_size ValueError #142

Merged
merged 17 commits into from
Jun 10, 2021

Conversation

cisaacstern
Copy link
Member

Here's my attempt at a PR to address the suggestion raised in pangeo-forge/staged-recipes#31 (comment)

I am not certain if the existing tests will cover this case, or if an additional test is required. If the latter, I'll appreciate guidance from others on how to implement an appropriate test.

Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com>
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for getting this started @cisaacstern!

One comment below.

@cisaacstern
Copy link
Member Author

Big thanks to @martindurant for merging fsspec/filesystem_spec#646 🙏

I've now updated this PR to reflect the presence of the newly added upstream error. (Our checks are not passing because the latest release of fsspec does not yet include the new BlockSizeError.)

A brief overview this PR's user experience follows. All outputs below reflect behavior with the unreleased fsspec installed from GitHub and pangeo-forge-recipes installed from this PR.

  1. We know that source file server for the NASA SMAP recipe which motivated this PR exhibits the issue in question, so we can use that as an example:
from smap_recipe import recipes  # `smap_recipe` is the above-linked recipe module

rec = recipes['NASA-SMAP-SSS/JPL/8day']  # select an arbitrary recipe from the `dict_object`

rec.fsspec_open_kwargs = {}  # clear its `fsspec_open_kwargs` to provoke the desired error
  1. Set metadata and caching targets. h/t @TomAugspurger for demonstrating an simple way to do this in Serialize file patterns #117 (comment)
import fsspec
from pangeo_forge_recipes.storage import MetadataTarget, CacheFSSpecTarget

fs_gcs = fsspec.get_filesystem_class("memory")("gcs")
fs_osn = fsspec.get_filesystem_class("memory")("osn")

target_base = f's3://Pangeo/pangeo-forge'
cache_base = f'gs://pangeo-forge-us-central1/pangeo-forge-cache'
metadata_base = f'gs://pangeo-forge-us-central1/pangeo-forge-metadata'
endpoint_url = 'https://ncsa.osn.xsede.org'

feedstock_name = 'block_error_test'
fmt = 'zarr'

for recipe_key, r in recipes.items():
    recipe_name = f'{feedstock_name}/{recipe_key}'
    r.input_cache = CacheFSSpecTarget(fs_gcs, f"{cache_base}/{recipe_name}")
    r.metadata_cache = MetadataTarget(fs_gcs, f"{metadata_base}/{recipe_name}")
    r.target = MetadataTarget(fs_osn, f"{target_base}/{recipe_name}.{fmt}") 
  1. Now when we attempt to cache the first input, we get a descriptive FSSpecOpenKwargsError including the suggested workaround:
rec.cache_input((0,))
Traceback (TL;DR: the new errors work as expected!)
---------------------------------------------------------------------------
BlockSizeError                            Traceback (most recent call last)
~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/storage.py in _copy_btw_filesystems(input_opener, output_opener, BLOCK_SIZE)
     33                     logger.debug("_copy_btw_filesystems reading data")
---> 34                     data = source.read(BLOCK_SIZE)
     35                     if not data:

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/implementations/http.py in read(self, length)
    519             length = min(self.size - self.loc, length)
--> 520         return super().read(length)
    521 

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/spec.py in read(self, length)
   1468             return b""
-> 1469         out = self.cache._fetch(self.loc, self.loc + length)
   1470         self.loc += len(out)

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/caching.py in _fetch(self, start, end)
    375             # First read, or extending both before and after
--> 376             self.cache = self.fetcher(start, bend)
    377             self.start = start

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
     84         self = obj or args[0]
---> 85         return sync(self.loop, func, *args, **kwargs)
     86 

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs)
     65     if isinstance(result[0], BaseException):
---> 66         raise result[0]
     67     return result[0]

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout)
     21     try:
---> 22         result[0] = await coro
     23     except Exception as ex:

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/implementations/http.py in async_fetch_range(self, start, end)
    582                         if cl > end - start:
--> 583                             raise BlockSizeError(
    584                                 "Got more bytes so far (>%i) than requested (%i)"

BlockSizeError: Got more bytes so far (>15244197) than requested (15242880)

The above exception was the direct cause of the following exception:

FSSpecOpenKwargsError                     Traceback (most recent call last)
<ipython-input-3-64bbec8bba7c> in <module>
----> 1 rec.cache_input((0,))

~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py in cache_input(self, input_key)
    291             logger.info(f"Caching input '{input_key}'")
    292             fname = self.file_pattern[input_key]
--> 293             self.input_cache.cache_file(fname, **self.fsspec_open_kwargs)
    294 
    295         if self._cache_metadata:

~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/storage.py in cache_file(self, fname, **open_kwargs)
    151         target_opener = self.open(fname, mode="wb")
    152         logger.info(f"Coping remote file '{fname}' to cache")
--> 153         _copy_btw_filesystems(input_opener, target_opener)
    154 
    155 

~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/storage.py in _copy_btw_filesystems(input_opener, output_opener, BLOCK_SIZE)
     38                     target.write(data)
     39                 except BlockSizeError as e:
---> 40                     raise FSSpecOpenKwargsError(
     41                         'Server does not permit random access to this file via Range requests. '
     42                         'Try re-instantiating recipe with fsspec_open_kwargs = {"block_size": 0}'

FSSpecOpenKwargsError: Server does not permit random access to this file via Range requests. Try re-instantiating recipe with fsspec_open_kwargs = {"block_size": 0}
  1. Finally, following the message's recommended workaround resolves the issue, as expected:
rec.fsspec_open_kwargs = {"block_size": 0}
rec.cache_input((0,))
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Caching input '(0,)'
pangeo_forge_recipes.storage - INFO - Caching file 'https://podaac-opendap.jpl.nasa.gov/opendap/allData/smap/L3/JPL/V5.0/8day_running/2015/120/SMAP_L3_SSS_20150504_8DAYS_V5.0.nc'
pangeo_forge_recipes.storage - INFO - File 'https://podaac-opendap.jpl.nasa.gov/opendap/allData/smap/L3/JPL/V5.0/8day_running/2015/120/SMAP_L3_SSS_20150504_8DAYS_V5.0.nc' is already cached

IMHO, this is ready to merge following the next release of fsspec.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @cisaacstern! Just a few suggestions.

We could also install fsspec from master in our CI environments to get the tests function. Or, as you suggest, wait for an fsspec release before merging. @martindurant - any idea when the next release might be?

cisaacstern and others added 4 commits June 3, 2021 15:04
Co-authored-by: Ryan Abernathey <ryan.abernathey@gmail.com>
Co-authored-by: Ryan Abernathey <ryan.abernathey@gmail.com>
@cisaacstern
Copy link
Member Author

We could also install fsspec from master in our CI environments to get the tests function. Or, as you suggest, wait for an fsspec release before merging. @martindurant - any idea when the next release might be?

Another good option. We can decide based on Martin's estimate of when the next release will be.

And for the sake of completeness, here's the updated Traceback, following incorporation of @rabernat's suggestions to: a. wrap only a single line in the try/except block; b. raise a ValueError instead of a custom error; c. add backticks and remove spaces to the message.

updated Traceback
---------------------------------------------------------------------------
BlockSizeError                            Traceback (most recent call last)
~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/storage.py in _copy_btw_filesystems(input_opener, output_opener, BLOCK_SIZE)
     33                 try:
---> 34                     data = source.read(BLOCK_SIZE)
     35                 except BlockSizeError as e:

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/implementations/http.py in read(self, length)
    519             length = min(self.size - self.loc, length)
--> 520         return super().read(length)
    521 

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/spec.py in read(self, length)
   1468             return b""
-> 1469         out = self.cache._fetch(self.loc, self.loc + length)
   1470         self.loc += len(out)

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/caching.py in _fetch(self, start, end)
    375             # First read, or extending both before and after
--> 376             self.cache = self.fetcher(start, bend)
    377             self.start = start

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
     84         self = obj or args[0]
---> 85         return sync(self.loop, func, *args, **kwargs)
     86 

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs)
     65     if isinstance(result[0], BaseException):
---> 66         raise result[0]
     67     return result[0]

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout)
     21     try:
---> 22         result[0] = await coro
     23     except Exception as ex:

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/implementations/http.py in async_fetch_range(self, start, end)
    582                         if cl > end - start:
--> 583                             raise BlockSizeError(
    584                                 "Got more bytes so far (>%i) than requested (%i)"

BlockSizeError: Got more bytes so far (>15317853) than requested (15242880)

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-3-64bbec8bba7c> in <module>
----> 1 rec.cache_input((0,))

~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py in cache_input(self, input_key)
    291             logger.info(f"Caching input '{input_key}'")
    292             fname = self.file_pattern[input_key]
--> 293             self.input_cache.cache_file(fname, **self.fsspec_open_kwargs)
    294 
    295         if self._cache_metadata:

~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/storage.py in cache_file(self, fname, **open_kwargs)
    151         target_opener = self.open(fname, mode="wb")
    152         logger.info(f"Coping remote file '{fname}' to cache")
--> 153         _copy_btw_filesystems(input_opener, target_opener)
    154 
    155 

~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/storage.py in _copy_btw_filesystems(input_opener, output_opener, BLOCK_SIZE)
     34                     data = source.read(BLOCK_SIZE)
     35                 except BlockSizeError as e:
---> 36                     raise ValueError(
     37                         "Server does not permit random access to this file via Range requests. "
     38                         'Try re-instantiating recipe with `fsspec_open_kwargs={"block_size": 0}`'

ValueError: Server does not permit random access to this file via Range requests. Try re-instantiating recipe with `fsspec_open_kwargs={"block_size": 0}`

@cisaacstern
Copy link
Member Author

@rabernat, fsspec has a new release, which I've pinned in the ci configs in the last commit here, and all checks now pass, so IMO this is ready to merge.

@rabernat rabernat merged commit 2c6de7a into pangeo-forge:master Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants