temperature on Flask Glacier for UNAVCO #1

jkingslake · 2023-02-10T17:07:03Z

Use AP RACMO output to compute an annual average air temperature and a seasonal climatology.

We can use this as impetus to put the AP RACMO data that @ecglazer has in the google bucket.

write notebook to load AP RACMO netcdfs as one dataset with all the variables in it
write it to a zarr in the google bucket

import fsspec
filename = 'gs://ldeo-glaciology/RACMO/AP
mapper = fsspec.get_mapper(filename, mode='w', token=token)
ds.to_zarr(mapper)

write a notebook to read the zarr and do the computations of mean and climatological temperatures etc.

The text was updated successfully, but these errors were encountered:

jkingslake · 2023-02-15T19:31:09Z

This notebook @ecglazer wrote puts does the first item above.

Something to be careful about before we write it to the google bucket is the chunk sizes:

are the chunk sizes reasonable in terms of MBs? (50-200MB) seems reasonable.
are they as uniform as possible within each dataset? They need to be the same size except for the last one for a write to zarr. I have had trouble in the past with the concat operation creating variably sized chunks and the write to zarr breaking.

@ecglazer, what do you get when you run

racmo.t2m.chunks

?

jkingslake · 2023-02-15T19:31:29Z

and I'll send you the token separately, so it doesnt go into this public repo

ecglazer · 2023-03-16T13:37:45Z

just made a pull request with a new version of the notebook that splits the dataset into appropriately sized chunks - they're each about 79 MiB.

when i try to write to the google bucket, it crashes the kernel. the daily resolution dataset is ~94 GB, and the 3-hourly resolution dataset is ~7 GB. this is when using the pangeo cloud.

jkingslake · 2023-03-16T13:53:14Z

Thanks for working on this!

It's useful to link to the pull request here so we see the code.

Why is the higher resolution dataset smaller in volume?

Are you using a dask cluster?

ecglazer · 2023-03-16T13:56:23Z

No problem! Here is the pull request: #3

The higher res dataset is smaller because it only covers a few years (2016-2021) and only includes t2m, while the daily dataset covers 1979-2018 and includes several variables.

I'm not familiar with how to use a dask cluster, but I can look into it

jkingslake · 2023-04-17T15:23:26Z

@ecglazer just showed me that the dask cluster is crashing when try to write even small versions of the RACMO data to zarr. This could be due to the dataset being made up of many many dataarrays. But I think the most likely issue is that the netcdfs that are being read are stored in @ecglazer's pangeo notebook workspace.

A better option is probably to put the netcdfs in the google bucket first, then read them from there. As described here

I will also need to add you as a user in the google cloud account.

jkingslake · 2023-04-17T15:26:24Z

You will also need to download and install the google cloud command-line interface: https://cloud.google.com/sdk/gcloud#download_and_install_the

ecglazer · 2023-04-18T02:09:51Z

Thanks for help, @jkingslake . I put all of the daily data in the Google bucket here: https://console.cloud.google.com/storage/browser/ldeo-glaciology/RACMO/RACMO_daily_by_var
The data is separated into netcdf files by variable. Let me know if there are any issues.

jkingslake · 2023-04-18T02:13:58Z

That's great. Have you managed to lazily load them?
I would try saving a small part of one of them to zarr and see if you still have the same issue with the notebook memory filling up quickly.

ecglazer · 2023-04-18T02:23:19Z

Yes, I'm able to lazily load each dataset in Jupyter, but the notebook memory still fills up quickly when I try to save to zarr (I tried with just 100 timesteps of one dataarray)

jkingslake · 2023-04-18T03:04:38Z

hmm, ok.

The only thing I have found so far that might help is that the following fails in the same way you were finding.

import xarray as xr
import gcsfs
gcs = gcsfs.GCSFileSystem()
url = 'gs://ldeo-glaciology/RACMO/RACMO_daily_by_var/RACMO_daily_t2m.nc'
of = gcs.open(url, mode='rb')
ds = xr.open_dataset(of, chunks={'time':-1}) 
ds.t2m.mean().compute()

This loads the data as one big chunk, which I think is why it fails.

While the following successfully provides the mean value of t2m (265.21317 K)

import xarray as xr
import gcsfs
gcs = gcsfs.GCSFileSystem()
url = 'gs://ldeo-glaciology/RACMO/RACMO_daily_by_var/RACMO_daily_t2m.nc'
of = gcs.open(url, mode='rb')
ds = xr.open_dataset(of, chunks={'time':-1}) 
ds.t2m.mean().compute()

This chunks the data as it is loaded and makes it possible to sppread the computation between multiple workers - i used 20 in this text and it took a couple of minutes. ds.nbytes/1e9 ~13 in these examples

jkingslake · 2023-04-18T03:34:22Z

update:
I found that the following successfully writes t2m to to zarr and reloads it.

import fsspec
import json
filename = 'gs://ldeo-glaciology/RACMO/AP_new/test_JK/t2m_all_v1'

mapper = fsspec.get_mapper(filename, mode='w',token=token)

ds.to_zarr(mapper)
# check that we can reload the whole thing
t2m_reloaded = xr.open_dataset(filename, engine='zarr', consolidated=True, chunks={})

Incidentally, taking the mean of all the t2m data with t2m_reloaded.t2m.mean().load() is much faster in this case because it is loading it from a zarr rather than a netcdf, as in the case above. Computing the mean from the netcdf stored in the google bucket takes . While computing the mean t2m from the zarr only take 16s.

A notebook demonstrating all this can be found here: https://github.com/ldeo-glaciology/AntPen_NSF_NERC/blob/chunk_edits/RACMO_Loading_JK.ipynb

jkingslake · 2023-04-18T03:53:53Z

p.s. I am using the LEAP pangeo.

jkingslake · 2023-09-27T03:08:13Z

@ecglazer, did you end up making any more progress on this?

jkingslake · 2024-04-03T17:45:21Z

@ecglazer, I have added a notebook on writing the full AP racmo data to zarr: https://github.com/ldeo-glaciology/AntPen_NSF_NERC/blob/racmo_zarr_JK/merging_RACMO_vars.ipynb

The full AP racmo dataset can be loaded with

racmo_AP = xr.open_dataset('gs://ldeo-glaciology/RACMO/JK_tests/all_vars_full_1' , engine='zarr', consolidated=True, chunks={})

When you get a chance could you go through the directories in https://console.cloud.google.com/storage/browser/ldeo-glaciology/RACMO/ and delete what you dont need anymore?

jkingslake · 2024-06-05T01:12:10Z

@ecglazer, dod you get a chance to go through the directories in https://console.cloud.google.com/storage/browser/ldeo-glaciology/RACMO/ and delete what you dont need anymore?

jkingslake changed the title ~~temperature on Flack for UNAVCO~~ temperature on Flask Glacier for UNAVCO Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

temperature on Flask Glacier for UNAVCO #1

temperature on Flask Glacier for UNAVCO #1

jkingslake commented Feb 10, 2023 •

edited by ecglazer

Loading

jkingslake commented Feb 15, 2023 •

edited

Loading

jkingslake commented Feb 15, 2023

ecglazer commented Mar 16, 2023

jkingslake commented Mar 16, 2023

ecglazer commented Mar 16, 2023

jkingslake commented Apr 17, 2023

jkingslake commented Apr 17, 2023

ecglazer commented Apr 18, 2023 •

edited

Loading

jkingslake commented Apr 18, 2023

ecglazer commented Apr 18, 2023

jkingslake commented Apr 18, 2023

jkingslake commented Apr 18, 2023 •

edited

Loading

jkingslake commented Apr 18, 2023

jkingslake commented Sep 27, 2023

jkingslake commented Apr 3, 2024 •

edited

Loading

jkingslake commented Jun 5, 2024

temperature on Flask Glacier for UNAVCO #1

temperature on Flask Glacier for UNAVCO #1

Comments

jkingslake commented Feb 10, 2023 • edited by ecglazer Loading

jkingslake commented Feb 15, 2023 • edited Loading

jkingslake commented Feb 15, 2023

ecglazer commented Mar 16, 2023

jkingslake commented Mar 16, 2023

ecglazer commented Mar 16, 2023

jkingslake commented Apr 17, 2023

jkingslake commented Apr 17, 2023

ecglazer commented Apr 18, 2023 • edited Loading

jkingslake commented Apr 18, 2023

ecglazer commented Apr 18, 2023

jkingslake commented Apr 18, 2023

jkingslake commented Apr 18, 2023 • edited Loading

jkingslake commented Apr 18, 2023

jkingslake commented Sep 27, 2023

jkingslake commented Apr 3, 2024 • edited Loading

jkingslake commented Jun 5, 2024

jkingslake commented Feb 10, 2023 •

edited by ecglazer

Loading

jkingslake commented Feb 15, 2023 •

edited

Loading

ecglazer commented Apr 18, 2023 •

edited

Loading

jkingslake commented Apr 18, 2023 •

edited

Loading

jkingslake commented Apr 3, 2024 •

edited

Loading