Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to write combined virtual dataset (for MUR SST) results in TypeError: Can only serialize wrapped arrays... #60

Closed
2 tasks
abarciauskas-bgse opened this issue Mar 27, 2024 · 4 comments

Comments

@abarciauskas-bgse
Copy link
Collaborator

abarciauskas-bgse commented Mar 27, 2024

Testing with a "real world dataset" (s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1) mostly worked, with a few changes required, which are present in https://github.com/TomNicholas/VirtualiZarr/tree/ab/testing-mursst. Specifically:

  • We need a way to read from S3 - perhaps this should be a separate issue. I wrote in a workaround but probably need a more thought-out solution (EDIT: Generating references from files in S3 (using kerchunk + fsspec) #61)
  • I got pydantic errors for the filters property on ZArray and Codecs which was returned from this dataset as a list of dictionaries, not a string (a list of dicts appears to conform to the zarr v2 storage spec but I'm not sure if something changed in v3 or it's expected that the filters are encoded as a string. I changed the type to use List[Dict].

With those changes in place, I was able to create the virtual zarr datasets, but when trying to write the combined reference to json, I got this error: *** TypeError: Can only serialize wrapped arrays of type ManifestArray, but got type <class 'numpy.ndarray'> which I haven't been able to figure out, yet.

Here is my code to replicate:

from virtualizarr import open_virtual_dataset
import xarray as xr
# first get + set credentials from https://archive.podaac.earthdata.nasa.gov/s3credentials
vds1 = open_virtual_dataset(
    's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
    # we have to put in the filetype to avoid trying to open the dataset with NetCDF4 
    filetype='netcdf4'
)
vds2 = open_virtual_dataset(
    's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210102090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
    filetype='netcdf4'
)
combined_vds = xr.concat([vds1, vds2], dim='time', coords='minimal', compat='override')
combined_vds['analysed_sst'].data.manifest.dict() # this works

# combined_vds.virtualize.to_kerchunk('combined.json', format='json')
# results in
# *** TypeError: Can only serialize wrapped arrays of type ManifestArray, but got type <class 'numpy.ndarray'>
@abarciauskas-bgse abarciauskas-bgse changed the title Trying to write combined virtual dataset results in *** TypeError: Can only serialize wrapped arrays of type ManifestArray, but got type <class 'numpy.ndarray'> Trying to write combined virtual dataset (for MUR SST) results in TypeError: Can only serialize wrapped arrays... Mar 27, 2024
@TomNicholas
Copy link
Member

TomNicholas commented Mar 27, 2024

Thanks @abarciauskas-bgse !

We need a way to read from S3

This I hadn't thought about yet. Thoughts and PR's welcome (and it deserves a separate issue - #61).

I got pydantic errors for the filters property on ZArray and Codecs

Thanks for reporting that. Do I reproduce just by calling .filters?

not sure if something changed in v3

The current code is supposed to work with v2, but there will be differences to smooth over (xref #17). Ideally I would be able to import classes directly from zarr-python to handle all of that.

TypeError: Can only serialize wrapped arrays of type ManifestArray, but got type <class 'numpy.ndarray'>

That's this error, which @norlandrhagen also reported. It will require another upstream adjustment to xarray to fix. In the meantime you should be able to avoid it by not creating indexes (i.e. pass indexes={} to open_virtual_dataset).

@abarciauskas-bgse
Copy link
Collaborator Author

Apologies this is one issue which can probably now be separated into 3 issues, 2 of which are open.

S3

@TomNicholas Thanks for opening #61 I may take a closer look at how we could incorporate reading from S3 tomorrow.

Filters typing

I am getting pydantic errors when using open_virtual_dataset for this dataset. I made a change to the filters property to change the type to Optional[List[Dict]]. Otherwise the traceback looks like this:

Traceback (most recent call last):
  File "/Users/aimeebarciauskas/github/developmentseed/pangeo-forge-aws-batch/docker-images/02_generate_kerchunk/testing-virtualizarr.py", line 4, in <module>
    vds1 = open_virtual_dataset(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aimeebarciauskas/github/developmentseed/pangeo-forge-aws-batch/docker-images/02_generate_kerchunk/VirtualiZarr/virtualizarr/xarray.py", line 74, in open_virtual_dataset
    vds = dataset_from_kerchunk_refs(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aimeebarciauskas/github/developmentseed/pangeo-forge-aws-batch/docker-images/02_generate_kerchunk/VirtualiZarr/virtualizarr/xarray.py", line 111, in dataset_from_kerchunk_refs
    vars[var_name] = variable_from_kerchunk_refs(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aimeebarciauskas/github/developmentseed/pangeo-forge-aws-batch/docker-images/02_generate_kerchunk/VirtualiZarr/virtualizarr/xarray.py", line 135, in variable_from_kerchunk_refs
    chunk_dict, zarray, zattrs = kerchunk.parse_array_refs(arr_refs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aimeebarciauskas/github/developmentseed/pangeo-forge-aws-batch/docker-images/02_generate_kerchunk/VirtualiZarr/virtualizarr/kerchunk.py", line 144, in parse_array_refs
    zarray = ZArray.from_kerchunk_refs(arr_refs.pop(".zarray"))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aimeebarciauskas/github/developmentseed/pangeo-forge-aws-batch/docker-images/02_generate_kerchunk/VirtualiZarr/virtualizarr/zarr.py", line 74, in from_kerchunk_refs
    return ZArray(
           ^^^^^^^
  File "/Users/aimeebarciauskas/github/developmentseed/eoapi/infrastructure/aws/.venv/lib/python3.11/site-packages/pydantic/main.py", line 150, in __init__
    __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)
pydantic_core._pydantic_core.ValidationError: 1 validation error for ZArray
filters
  Input should be a valid string [type=string_type, input_value=[{'elementsize': 2, 'id':...d': 'zlib', 'level': 7}], input_type=list]
    For further information visit https://errors.pydantic.dev/2.1.2/v/string_type

Would it help if I created a separate issue for this error with a minimally reproducible example (via an artificially generated dataset, perhaps?)

TypeError: Can only serialize wrapped arrays of type ManifestArray, but got type <class 'numpy.ndarray'>

FWIW I get this error even when passing indexes={} to open_virtual_dataset, but I don't have more information about why yet.

@TomNicholas
Copy link
Member

I may take a closer look at how we could incorporate reading from S3 tomorrow.

That would be awesome. Especially as it seems solving that issue should be quite separate from all the guts of the rest of the package.


Would it help if I created a separate issue for this error with a minimally reproducible example (via an artificially generated dataset, perhaps?)

That would certainly be the most correct way to move forward! But also if you think the fix is just a simple change of type hint then I'm happy to just accept a PR for that.


FWIW I get this error even when passing indexes={} to open_virtual_dataset, but I don't have more information about why yet.

That's weird. Are you sure you're using both the most recent version of this package (i.e. main because I haven't released it yet) and also the forked branch of xarray (see #14 (comment))?

You will get this error when your virtual dataset contains any arrays that are not ManifestArrays. In your case it will be because the coordinate arrays are somehow being accidentally coerced to np.ndarray inside xr.concat. We could actually imagine writing these out to disk anyway, see #62.

@abarciauskas-bgse
Copy link
Collaborator Author

@TomNicholas you were right, I had not correctly installed the forked branch of xarray in my testing. For future reference:

pip install xarray@git+https://github.com/TomNicholas/xarray@concat-no-indexes

Once I had verified the forked version was installed, I ran the example again and following completes without error:

from virtualizarr import open_virtual_dataset
import xarray as xr

# first get + set credentials from https://archive.podaac.earthdata.nasa.gov/s3credentials
vds1 = open_virtual_dataset(
    's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
    # we have to put in the filetype to avoid trying to open the dataset with NetCDF4 
    filetype='netcdf4',
    indexes={}
)
vds2 = open_virtual_dataset(
    's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210102090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
    filetype='netcdf4',
    indexes={}
)
combined_vds = xr.concat([vds1, vds2], dim='time', coords='minimal', compat='override')
combined_vds['analysed_sst'].data.manifest.dict() # this works

combined_vds.virtualize.to_kerchunk('combined.json', format='json')

This issue is now covered by #61 and #65, so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants