Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

temperature on Flask Glacier for UNAVCO #1

Open
1 of 3 tasks
jkingslake opened this issue Feb 10, 2023 · 16 comments
Open
1 of 3 tasks

temperature on Flask Glacier for UNAVCO #1

jkingslake opened this issue Feb 10, 2023 · 16 comments

Comments

@jkingslake
Copy link
Member

jkingslake commented Feb 10, 2023

Use AP RACMO output to compute an annual average air temperature and a seasonal climatology.

We can use this as impetus to put the AP RACMO data that @ecglazer has in the google bucket.

  • write notebook to load AP RACMO netcdfs as one dataset with all the variables in it
  • write it to a zarr in the google bucket
import fsspec
filename = 'gs://ldeo-glaciology/RACMO/AP
mapper = fsspec.get_mapper(filename, mode='w', token=token)
ds.to_zarr(mapper)
  • write a notebook to read the zarr and do the computations of mean and climatological temperatures etc.
@jkingslake jkingslake changed the title temperature on Flack for UNAVCO temperature on Flask Glacier for UNAVCO Feb 15, 2023
@jkingslake
Copy link
Member Author

jkingslake commented Feb 15, 2023

This notebook @ecglazer wrote puts does the first item above.

Something to be careful about before we write it to the google bucket is the chunk sizes:

  1. are the chunk sizes reasonable in terms of MBs? (50-200MB) seems reasonable.
  2. are they as uniform as possible within each dataset? They need to be the same size except for the last one for a write to zarr. I have had trouble in the past with the concat operation creating variably sized chunks and the write to zarr breaking.

@ecglazer, what do you get when you run

racmo.t2m.chunks

?

@jkingslake
Copy link
Member Author

and I'll send you the token separately, so it doesnt go into this public repo

@ecglazer
Copy link
Contributor

just made a pull request with a new version of the notebook that splits the dataset into appropriately sized chunks - they're each about 79 MiB.

when i try to write to the google bucket, it crashes the kernel. the daily resolution dataset is ~94 GB, and the 3-hourly resolution dataset is ~7 GB. this is when using the pangeo cloud.

@jkingslake
Copy link
Member Author

Thanks for working on this!

It's useful to link to the pull request here so we see the code.

Why is the higher resolution dataset smaller in volume?

Are you using a dask cluster?

@ecglazer
Copy link
Contributor

No problem! Here is the pull request: #3

The higher res dataset is smaller because it only covers a few years (2016-2021) and only includes t2m, while the daily dataset covers 1979-2018 and includes several variables.

I'm not familiar with how to use a dask cluster, but I can look into it

@jkingslake
Copy link
Member Author

@ecglazer just showed me that the dask cluster is crashing when try to write even small versions of the RACMO data to zarr. This could be due to the dataset being made up of many many dataarrays. But I think the most likely issue is that the netcdfs that are being read are stored in @ecglazer's pangeo notebook workspace.

A better option is probably to put the netcdfs in the google bucket first, then read them from there. As described here

I will also need to add you as a user in the google cloud account.

@jkingslake
Copy link
Member Author

You will also need to download and install the google cloud command-line interface: https://cloud.google.com/sdk/gcloud#download_and_install_the

@ecglazer
Copy link
Contributor

ecglazer commented Apr 18, 2023

Thanks for help, @jkingslake . I put all of the daily data in the Google bucket here: https://console.cloud.google.com/storage/browser/ldeo-glaciology/RACMO/RACMO_daily_by_var
The data is separated into netcdf files by variable. Let me know if there are any issues.

@jkingslake
Copy link
Member Author

That's great. Have you managed to lazily load them?
I would try saving a small part of one of them to zarr and see if you still have the same issue with the notebook memory filling up quickly.

@ecglazer
Copy link
Contributor

Yes, I'm able to lazily load each dataset in Jupyter, but the notebook memory still fills up quickly when I try to save to zarr (I tried with just 100 timesteps of one dataarray)

@jkingslake
Copy link
Member Author

hmm, ok.

The only thing I have found so far that might help is that the following fails in the same way you were finding.

import xarray as xr
import gcsfs
gcs = gcsfs.GCSFileSystem()
url = 'gs://ldeo-glaciology/RACMO/RACMO_daily_by_var/RACMO_daily_t2m.nc'
of = gcs.open(url, mode='rb')
ds = xr.open_dataset(of, chunks={'time':-1}) 
ds.t2m.mean().compute()

This loads the data as one big chunk, which I think is why it fails.

While the following successfully provides the mean value of t2m (265.21317 K)

import xarray as xr
import gcsfs
gcs = gcsfs.GCSFileSystem()
url = 'gs://ldeo-glaciology/RACMO/RACMO_daily_by_var/RACMO_daily_t2m.nc'
of = gcs.open(url, mode='rb')
ds = xr.open_dataset(of, chunks={'time':-1}) 
ds.t2m.mean().compute()

This chunks the data as it is loaded and makes it possible to sppread the computation between multiple workers - i used 20 in this text and it took a couple of minutes. ds.nbytes/1e9 ~13 in these examples

@jkingslake
Copy link
Member Author

jkingslake commented Apr 18, 2023

update:
I found that the following successfully writes t2m to to zarr and reloads it.

import fsspec
import json
filename = 'gs://ldeo-glaciology/RACMO/AP_new/test_JK/t2m_all_v1'

mapper = fsspec.get_mapper(filename, mode='w',token=token)

ds.to_zarr(mapper)
# check that we can reload the whole thing
t2m_reloaded = xr.open_dataset(filename, engine='zarr', consolidated=True, chunks={}) 

Incidentally, taking the mean of all the t2m data with t2m_reloaded.t2m.mean().load() is much faster in this case because it is loading it from a zarr rather than a netcdf, as in the case above. Computing the mean from the netcdf stored in the google bucket takes . While computing the mean t2m from the zarr only take 16s.

A notebook demonstrating all this can be found here: https://github.com/ldeo-glaciology/AntPen_NSF_NERC/blob/chunk_edits/RACMO_Loading_JK.ipynb

@jkingslake
Copy link
Member Author

p.s. I am using the LEAP pangeo.

@jkingslake
Copy link
Member Author

@ecglazer, did you end up making any more progress on this?

@jkingslake
Copy link
Member Author

jkingslake commented Apr 3, 2024

@ecglazer, I have added a notebook on writing the full AP racmo data to zarr: https://github.com/ldeo-glaciology/AntPen_NSF_NERC/blob/racmo_zarr_JK/merging_RACMO_vars.ipynb

The full AP racmo dataset can be loaded with

racmo_AP = xr.open_dataset('gs://ldeo-glaciology/RACMO/JK_tests/all_vars_full_1' , engine='zarr', consolidated=True, chunks={}) 

When you get a chance could you go through the directories in https://console.cloud.google.com/storage/browser/ldeo-glaciology/RACMO/ and delete what you dont need anymore?

@jkingslake
Copy link
Member Author

@ecglazer, dod you get a chance to go through the directories in https://console.cloud.google.com/storage/browser/ldeo-glaciology/RACMO/ and delete what you dont need anymore?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants