diff --git a/05_zarr_tools/54_zarr_R_rarr.qmd b/05_zarr_tools/54_zarr_R_rarr.qmd new file mode 100644 index 0000000..71da0cd --- /dev/null +++ b/05_zarr_tools/54_zarr_R_rarr.qmd @@ -0,0 +1,392 @@ +--- +title: "Access EOPF Sentinel Zarr using `Rarr`" +format: html +execute: + cache: refresh +--- + +### Introduction + +This notebook demonstrates how to **retrieve** remotely stored Zarr data using the `Rarr` package in R. We will explore how to **read and visualize** Zarr data (zarrays) and their metadata. For subsequent analysis and visualization we turn the zarray into `stars` objects which are more suitable for spatial data. + +::: {.callout-tip} +This notebook has a sibling, which demonstrates how to access the same data using the GDAL Zarr driver. You can find it [here](./55_zarr_R_gdal.qmd). +::: + +### What we will learn + +- ✏️ How to edit URLs of Zarr stores to make them readable for GDAL +- 🔎 Which read-functions and arguments to use in `stars` and `terra` +- 🚧 Current limitations of these packages + +### Prerequisites + +To follow this notebook, you will need to load a STAC item in Zarr format from the [EOPF Zarr STAC Catalog](https://stac.browser.user.eopf.eodc.eu/?.language=en). If you are new to STAC inside the R environment, we suggest to follow the [Access the EOPF Zarr STAC API with R tutorial](https://eopf-toolkit.github.io/eopf-101/51_eopf_stac_r.html). The item we are using for this example mirrors the one used in the [Sentinel-2 L1C MSI Zarr Product Exploration notebook](https://eopf-sample-service.github.io/eopf-sample-notebooks/sentinel-2-l1c-msi-zarr-product-exploration/#introduction) which is hosted on EODC's object storage and is accessible under the URL below. + +```{r} +zarr_url = "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:sample-data/tutorial_data/cpm_v253/S2B_MSIL1C_20250113T103309_N0511_R108_T32TLQ_20250113T122458.zarr" +``` + +### Import packages + +```{r} +#| message: false +#| warning: false + +library(Rarr) # reading; BiocManager::install("Rarr") +library(stars) # spatial objects +library(dplyr) # table operations +library(jsonlite) # parsing metadata +library(mapview) # interactive maps +``` + +## Metadata exploration with Rarr + +::: {.callout-important} +In this notebook we show the current capabilities and limitations of `Rarr` in terms of reading Zarr array(s) and their metadata. Successful attempts are marked with a ✅, failed attempts with a ⛔. +::: + +Using the `Rarr::zarr_overview` function we can get meta data for all assets in the remote Zarr store as a table. Here list all arrays and their data type, compression, dimensions, etc. + +```{r} +zarr_content = zarr_overview(zarr_url, as_data_frame = T) |> + mutate(var = sub(zarr_url, "", path)) |> + select(var, everything(), -path) + +filter(zarr_content, grepl("reflectance", var)) +``` + + We can see that all Sentinel-2 bands are present, but split into directories (or groups) based on spatial resolution (10, 20, 60 meters), which have different dimensions (number and size of pixels). + +::: {.callout-tip collapse="true"} +## Expand for full Zarr metadata output + +```{r} +#| echo: false + +print(zarr_content) +``` +::: + +Now that we have explored the content, let's get some related metadata. `Rarr` provides the `read_zattrs` function for this purpose, which looks for the `.zattrs` file inside the Zarr store. + +⛔ Unfortunately we were not successful when applying it to our remote store. +```{r} +#| error: true + +read_zattrs(zarr_url) +read_zattrs(file.path(zarr_url, ".zattrs")) +``` + +✅ As a simple workaround we can read the JSON directly: +```{r} +zattrs = read_json(file.path(zarr_url, ".zattrs")) +names(zattrs) +``` + +::: {.callout-tip collapse="true"} +## Expand for full Zarr metadata output + +```{r} +#| echo: false + +print(zarr_content) +``` +::: + +The same information can be retrieved with `sf::gdal_utils('mdiminfo', ...)`. + +## Reading Zarr data with `Rarr` + +Reading zarrays from a remote Zarr store with `Rarr` and turning them into useful spatial R objects is not as straightforward as with `GDAL`-based `stars` and `terra` functions. We will demonstrate the current capabilities and limitations of `Rarr` by using a reference object retrieved through the `GDAL` Zarr driver which is loaded in the background. + +```{r} +#| echo: false +#| warning: false + +# parse GDAL-compatible VSI URL +vsi_url = paste0("ZARR:/vsicurl/", dplyr::as_label(zarr_url)) + +# read from remote and set CRS +band_variable = "/measurements/reflectance/r60m/b01" +zz_gdal = read_stars(paste(vsi_url, band_variable, sep = ":")) +st_crs(zz_gdal) = st_crs(32632) +``` + +```{r} +# GDAL-based reference object +zz_gdal +``` + +We start with an naive attempt to read the entire Zarr store using `Rarr::read_zarr_array`. + + +⛔ Rarr, like `stars` and `terra`, is not capable of reading multiple arrays from the remote Zarr store simultaneously. + +```{r} +#| error: true + +read_zarr_array(zarr_url) +``` + +✅ But if a single band array is selected we are able to access the desired information. + +```{r} +band_variable = "/measurements/reflectance/r60m/b01" + +zarray = read_zarr_array(paste0(zarr_url, band_variable)) +str(zarray) +``` + +### Raster geometry + +The function returns a matrix (2D array) lacking any spatial metadata (e.g. Coordinate Reference System) or map coordinates. Let's plot it together with the reference object for comparison. + +```{r} +#| fig-subcap: +#| - "Rarr (raw zarray)" +#| - "GDAL (directly retrieved stars object)" +#| layout-ncol: 2 +#| fig-height: 7 + +image(zarray) +image(zz_gdal$b01) +``` + + +::: {.callout-warning} +## Caution + +The raw zarray has flipped X- and Y-dimensions! But we can access X and Y coordinates for each pixel separately and handle the issue by transposing them later (using `t()`). +::: + +```{r} +# retrieve X and Y coordinates +zarray_60m_x = paste0(zarr_url, "/measurements/reflectance/r60m/x") |> + read_zarr_array() +zarray_60m_y = paste0(zarr_url, "/measurements/reflectance/r60m/y") |> + read_zarr_array() + +range(zarray_60m_x) +range(zarray_60m_y) +``` + +Regardless of their ordering, are the coordinate values consistent with those retrieved via GDAL? + +```{r} +range(st_get_dimension_values(zz_gdal, "x")) +range(st_get_dimension_values(zz_gdal, "y")) +``` + +✅ Yes! + +### Reflectance values + +Before building the spatial object we should also check if the actual data values are consistent between both methods. + +```{r} +head(t(zarray)[1,]) # flipped zarray +head(zz_gdal$b01[1,]) # reference arrray +``` + +::: {.callout-warning} +## Caution + +By comapring the first few values of each array we can observe 2 important differences: + + - `zarray` contains integer values while `zz_gdal` has floating point values scaled with factor 10,000. + - after rescaling, min and max for example still differ (0.1299 vs. 0.2299 / 1.6556 vs. 1.7556), indicating an offset of 0.1. + +💡 **Explanation:** *The data provider applies scale and offset to store more memory-efficient integer values. A scale factor of 10,000 preserves 4 decimals, while an offset of 0.1 avoids the multiplication of 0. When reading raw data like that we need to apply scale and offset manually to get the original reflectance values.* +::: + +Scale and offset parameters are important metadata, but currently not accessible via `Rarr`. At this point we can only find it by digging deep into the metadata retrieved by `GDAL`: + +```{r} +# GDAL-comaptible VSI URL +vsi_url = paste0("ZARR:/vsicurl/", dplyr::as_label(zarr_url)) + +# get entire metadata +info = sf::gdal_utils("mdiminfo", source = vsi_url, quiet = T) |> fromJSON() + +# access scale and offset for 60 meter reflectance bands +(scale = info$groups$measurements$groups$reflectance$groups$r60m$arrays$b01$scale) +(offset = info$groups$measurements$groups$reflectance$groups$r60m$arrays$b01$offset) +``` + +### Constructing a `stars` object from zarray components + +Let's combine all pieces of the puzzle to form a spatial object: + +- data array +- scale and offset +- X and Y coordinates +- CRS + +```{r} +# read CRS from Zarr attributes +zarr_crs = zattrs$stac_discovery$properties$`proj:epsg` + +# band name +var = basename(band_variable) + +# read and transpose data matrix +# apply scale and offset +# convert to stars +zz = st_as_stars(t(zarray) * scale + offset) |> + st_set_dimensions(1, names = "x", values = zarray_60m_x, is_raster = TRUE) |> + st_set_dimensions(2, names = "y", values = zarray_60m_y, is_raster = TRUE) |> + setNames(var) + +# assign CRS +st_crs(zz) = st_crs(zarr_crs) + +# print the object +zz +``` + +We can compare the newly constructed object with reference read via GDAL. They should have identical geometry and values. + +```{r} +identical(zz[[1]] |> str(), + zz_gdal[[1]] |> str()) +``` + +Looks good! Let's visualize both objects. + +```{r} +#| fig-subcap: +#| - "Rarr (constructed stars object)" +#| layout-ncol: 1 + +# plot as maps +plot(zz, axes = TRUE) +``` + +```{r} +#| fig-subcap: +#| - "GDAL (directly retrieved stars object)" +#| layout-ncol: 1 + +plot(zz_gdal, axes = TRUE) +``` + +The identical maps show Sentinel-2 band 1 (coastal aerosol band) reflectance at 60m spatial resolution for north-western Italy with the Alps in the West and the city of Torino in the North-East. + +### Wrapping it all up in a function + +We can enclose these steps in a single function with some flexibility for variable names and resolutions. This allows us to retrieve any band at any available resolution from the remote Zarr store. Since scale and offset are not accessible via `Rarr`, we need to provide them as function arguments. + +```{r} +st_read_zarray = function(path, var, res, scale = 1, offset = 0, ...){ + + # room for checks and input tests: + # stopifnot valid zarr url / variable / resolution ... + + # get metadata including CRS + zattrs = jsonlite::read_json(file.path(path, ".zattrs")) + zarr_crs = zattrs$stac_discovery$properties$`proj:epsg` + + # selecte the correct spatial resolution + # zattrs$stac_discovery$properties$`eopf:resolutions` + res_char = switch(as.character(res), + "10" = "r10m", + "20" = "r20m", + "60" = "r60m", + stop("Resolution must be one of 10, 20, or 60.") + ) + + # get 2D data array and 1D coordinate arrays + # ...: make use of index arguments of read_zarr_array if desired + zarray = file.path(path, "measurements/reflectance", res_char, var) |> + read_zarr_array(...) + zarray_x = file.path(path, "measurements/reflectance", res_char, "x") |> + read_zarr_array(...) + zarray_y = file.path(path, "measurements/reflectance", res_char, "y") |> + read_zarr_array(...) + + # transpose matrix before converting to stars, apply scale and offset + # assign x and y coordinates and CRS + z = st_as_stars(t(zarray) * scale + offset) |> + st_set_dimensions(1, names = "x", values = zarray_60m_x, is_raster = TRUE) |> + st_set_dimensions(2, names = "y", values = zarray_60m_y, is_raster = TRUE) |> + setNames(var) + st_crs(z) = st_crs(zarr_crs) + return(z) +} +``` + + +## Read and combine multiple zarrays + +✅ Now we can read multiple bands, combine them to a single multi-band object, and visualize them in a static map... + +```{r} +system.time({ + # scale: 0.0001 = 1 / 10,000 + b01 = st_read_zarray(zarr_url, "b01", 60, scale=0.0001, offset = 0.1) + b09 = st_read_zarray(zarr_url, "b09", 60, scale=0.0001, offset = 0.1) + multi_band = c(b01, b09) |> merge(name="band") |> setNames("reflectance") +}) + +plot(multi_band, axes = TRUE) +``` + +... or by generating an interactive map: + +```{r} +#| echo: false +mapviewOptions(basemaps = c("OpenTopoMap","Esri.WorldImagery", + "CartoDB.Positron")) +``` + +```{r} +#| message: false +#| cache: true + +mapview(multi_band) +``` + + +## Benchmark + +Using `Rarr` to read remote Zarr stores is just one of several options. R packages `stars` and `terra` provide functionality, that uses GDAl's Zarr driver in the background. To see if there is a benefit to each of the methods, we can compare their execution time and memory allocation with a little benchmark. + +```{r} +# GDAL-compatible VSI URL +vsi_prefix = "ZARR:/vsicurl/" +vsi_url = paste0(vsi_prefix, dplyr::as_label(zarr_url)) + +# functions for reading a single Sentinel-2 band to memory +f_stars = function(){rs = read_stars(paste(vsi_url, band_variable, sep = ":"))} +f_mdim = function(){sm = read_mdim(vsi_url, band_variable, proxy = FALSE) |> + setNames(basename(band_variable))} +f_rast = function(){tr = terra::rast(vsi_url, band_variable) |> terra::toMemory()} +f_rarr = function(){rr = st_read_zarray(zarr_url, "b01", 60)} +``` + +```{r} +#| warning: false +#| cache: true + +bench::mark(f_stars(), f_mdim(), f_rast(), f_rarr(), + check = F, iterations = 5) +``` + +The benchmark does not reveal a clear winner in terms of speed. However, `Rarr` seems not to be faster than `GDAL` but allocates more memory compared to `stars` and `terra`. + +## 💪 Now it is your turn + +- 🔭 **Task**: Supply more arguments to `st_read_zarray` and pass a subset to read only parts on an array. Visit `?Rarr::read_zarr_array` for more details. +- 💾 **Task**: Try writing zarrays to disk using `Rarr::write_zarr_array()`. +- 📖 Read the [Rarr documentation](https://huber-group-embl.github.io/Rarr/articles/Rarr.html). + +## Conclusion + +In this notebook we have demonstrated how to access remote Zarr stores in R using the `Rarr` package. We have seen that it is possible to read specific data arrays, but `read_zarr_array` lacks the ability of accessing anything other than the raw array. To combine it with a CRS or coordinates we need to leverage a spatial class like `stars` and manually set the relevant metadata. Further, the benchmark did not reveal a significant benefit regarding processing time or memory allocation over "traditional" GDAL-based data retrieval. + +## What's next? + +Further tests should evaluate wether data retrieved via `Rarr` versus GDAL are identical in terms of spatial extent and offset. + +Explore [this related notebook](./55_zarr_R_gdal.qmd) which demonstrates how to access the same Zarr data using the GDAL Zarr driver via the `stars` and `terra` packages. \ No newline at end of file diff --git a/05_zarr_tools/55_zarr_R_gdal.qmd b/05_zarr_tools/55_zarr_R_gdal.qmd new file mode 100644 index 0000000..f512827 --- /dev/null +++ b/05_zarr_tools/55_zarr_R_gdal.qmd @@ -0,0 +1,224 @@ +--- +title: "Access EOPF Sentinel Zarr via GDAL in `stars` and `terra`" +format: html +execute: + cache: refresh +--- + +### Introduction + +This notebook demonstrates how to **retrieve** remotely stored Zarr data using the Zarr GDAL driver in R. We will explore how to **read and visualize** Zarr data (zarrays) and their metadata using the `stars`, `sf` and `terra` packages as well as their current limitations. + +::: {.callout-tip} +This notebook has a sibling, which demonstrates how to access the same Zarr data using the `Rarr` package. You can find it [here](./54_zarr_R_rarr.qmd). +::: + +### What we will learn + +- ✏️ How to edit URLs of Zarr stores to make them readable for GDAL +- 🔎 Which read-functions and arguments to use in `stars` and `terra` +- 🚧 Current limitations of these packages + +### Prerequisites + +To follow this notebook, you will need to load a STAC item in Zarr format from the [EOPF Zarr STAC Catalog](https://stac.browser.user.eopf.eodc.eu/?.language=en). If you are new to STAC inside the R environment, we suggest to follow the [Access the EOPF Zarr STAC API with R tutorial](https://eopf-toolkit.github.io/eopf-101/51_eopf_stac_r.html). The item we are using for this example mirrors the one used in the [Sentinel-2 L1C MSI Zarr Product Exploration notebook](https://eopf-sample-service.github.io/eopf-sample-notebooks/sentinel-2-l1c-msi-zarr-product-exploration/#introduction) which is hosted on EODC's object storage and is accessible under the URL below. + +```{r} +zarr_url = "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:sample-data/tutorial_data/cpm_v253/S2B_MSIL1C_20250113T103309_N0511_R108_T32TLQ_20250113T122458.zarr" +``` + +### Import packages + +```{r} +#| message: false +#| warning: false + +library(stars) # reading; imports sf too +library(terra) # reading +library(jsonlite) # parsing metadata +library(mapview) # interactive maps +``` + + +## GDAL's Zarr driver + +Common R-packages for raster data like `stars` and `terra` can leverage GDAL's raster drivers to read data directly from remote locations. This is done using GDAL's virtual file system (VSI) capabilities, specifically the `/vsicurl/` prefix for accessing files over HTTP(S). + +For Zarr data GDAL provides a dedicated Zarr driver that requires an additional `ZARR:`-prefix to the VSI URL. Finally, the complete prefix needs to be quoted: `"ZARR:/vsicurl/"`. + +```{r} +vsi_prefix = "ZARR:/vsicurl/" +vsi_url = paste0(vsi_prefix, dplyr::as_label(zarr_url)) # needs special quoting +print(vsi_url) +``` + +## Metadata exploration with GDAL + +With the VSI URL pointing to the remote Zarr store we can now use GDAL utilities to explore the metadata of the Zarr store. The `sf::gdal_utils` function provides an interface to GDAL command-line utilities. We use the `mdiminfo` command to retrieve metadata about the multidimensional arrays (zarrays) contained within the Zarr store. We only print the top-level names of the arrays here. + +```{r} +metadata = sf::gdal_utils("mdiminfo", source = vsi_url, quiet = T) |> + fromJSON() +names(metadata) +``` + +::: {.callout-tip collapse="true"} +## Expand for full GDAL MDIM metadata output +```{r} +#| echo: false + +print(metadata) +``` +::: + + +## Reading Zarr data with `stars` and `terra` + +::: {.callout-important} +In this section we show the current capabilities and limitations of `stars` and `terra` when it comes to reading Zarr array(s). We test the functions `read_stars` and `read_mdim` from `stars` and `rast` from the `terra`package. +Successful attempts are marked with a ✅, failed attempts with a ⛔. +::: + +First we define the path to a specific data array (band 1, 60 meter resolution) inside the Zarr store which can be understood as a single raster layer. + +```{r, error=T} +band_variable = "/measurements/reflectance/r60m/b01" +``` + +### `stars::read_stars()` + +⛔ Traditional approach for reading all bands/layers and specifying a driver. +```{r} +#| error: true + +r = read_stars(vsi_url, driver = "ZARR") +``` + +⛔ Traditional approach for reading all bands/layers and specifying a sub-dataset (integer). +```{r} +#| error: true + +r = read_stars(vsi_url, sub = 1) +``` + +⛔ Traditional approach for reading all bands/layers and specifying a sub-dataset (path). +```{r} +#| error: true + +r = read_stars(vsi_url, sub = band_variable) +``` + +✅ Constructing the full zarray path from prefix, URL, and band variable. + +```{r} +(r = read_stars(paste(vsi_url, band_variable, sep = ":"))) +st_crs(r) # NA: empty +``` + +This method successfully reads the specified band from the remote Zarr store. + +:::{.callout-warning} +Note that the coordinate reference system (CRS) is not automatically recognized and needs to be set manually. We can look it up in the metadata where it is listed as one of the STAC properties. +::: + +```{r} +(crs = metadata$attributes$stac_discovery$properties$`proj:epsg`) +st_crs(r) = crs +``` + +Our data is in CRS _WGS 84 / UTM zone 32N (EPSG:32632)_. +We can now safely plot it as a static map... + +```{r} +system.time(plot(r, axes = TRUE)) +``` + +... or is an interactive visualization: + +```{r} +#| message: false +#| echo: false + +mapviewOptions(basemaps = c("OpenTopoMap","Esri.WorldImagery", + "CartoDB.Positron")) +``` + +```{r} +mapview(r) +``` + +--- + +### `stars::read_mdim()` + +⛔ Traditional approach for reading all bands/layers at once. "?" should return a list of array possible names. +```{r} +#| error: true + +m = read_mdim(vsi_url) +m = read_mdim(vsi_url, variable = "?") # query array names +``` + + +✅ Specifying the full zarray path from prefix, URL, and band variable. Read as a proxy object (just metadata; values are accessed lazily when required). +```{r} +system.time({ + (m = read_mdim(vsi_url, variable = band_variable, proxy = TRUE)) + }) # fast: only reads metadata + +st_crs(m) # NA: empty +system.time(plot(m, axes = T)) # slow: only here the full array is downloaded +``` + +As for `read_stars()`, the CRS is not automatically recognized and needs to be set manually. + +--- + +### `terra::rast()` + +⛔ Traditional approach for reading all bands/layers at once. + +```{r, error=T} +(tr = terra::rast(vsi_url)) +tr$b02 + +names(tr) # has non-unique names, e.g. three times "b02" +``` + +`terra` reads some of the available bands, but fails to name them uniquely. It likely searches for band names like "b02" and finds the first group of data arrays by order, in this case the 10-meter bands (2, 3, 4, and 8 aka R,G,B, and NIR). The Zarr store contains multiple arrays named after each band such as `/measurements/reflectance/r10m/b02`, a corresponding mask (e.g. `quality/mask/r10m/b02`) and the detector footprint (e.g. `/conditions/mask/detector_footprint/r10m/b02`). All of these zarrays share the same name, `b02`, leading to non-unique layer names in the resulting `SpatRaster` object. When constructing unique names we can at least visualize the data. + +```{r} +# rename layers to (not meaningful) unique names +names(tr) = paste0(letters[1:nlyr(tr)], names(tr)) +names(tr) + +plot(tr) +``` + +✅ Specifying the full zarray path from prefix, URL, and band variable as sub-dataset. Read as proxy object. +```{r} +system.time({ + tr = rast(vsi_url, subds = band_variable) + })# fast: only reads metadata + +crs(tr) # "": empty +``` + +As for the `stars`-functions, the CRS is not automatically recognized and needs to be set manually. + +```{r} +system.time(plot(tr)) # slow: only here the full array is downloaded +``` + +## 💪 Now it is your turn + +- 🔍 **Task**: Explore the metadata of other Zarr stores using GDAL's `mdiminfo` command via `sf::gdal_utils()`. Store the JSON output in an R-object and find entries that store relevant metadata, such as the CRS, bounding box, or sensor-related information. +- 📖 Read the [GDAL Zarr driver documentation](https://gdal.org/en/stable/drivers/raster/zarr.html) to learn more about opening options. + +## Conclusion + +In this notebook we have demonstrated how to access remote Zarr stores using GDAL's Zarr driver in R with the `stars` and `terra` packages. We have seen that while it is possible to read specific data arrays, there are some limitations, such as the lack of automatic CRS recognition and challenges with reading multiple arrays simultaneously. + +## What's next? + +Explore [this related notebook](./54_zarr_R_rarr.qmd) which demonstrates how to access the same Zarr data using the `Rarr` package and building a `stars` object from it. diff --git a/deployment/environment.yaml b/deployment/environment.yaml index fb258ce..37e1fd4 100644 --- a/deployment/environment.yaml +++ b/deployment/environment.yaml @@ -11,5 +11,8 @@ dependencies: - r-fs=1.6.6 - r-lobstr=1.1.3 - r-biocmanager + - r-mapview=2.11.4 + - r-jsonlite=2.0.0 + # - bioconductor-rarr==1.10.1 - Currently not supported in bioconda so installed manually in Dockerfile_r - sqlite=3.51.2 # - bioconductor-rarr==1.10.1 - Currently not supported in bioconda so installed manually in Dockerfile_r