Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
392 changes: 392 additions & 0 deletions 05_zarr_tools/54_zarr_R_rarr.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,392 @@
---
title: "Access EOPF Sentinel Zarr using `Rarr`"
format: html
execute:
cache: refresh
---

### Introduction

This notebook demonstrates how to **retrieve** remotely stored Zarr data using the `Rarr` package in R. We will explore how to **read and visualize** Zarr data (zarrays) and their metadata. For subsequent analysis and visualization we turn the zarray into `stars` objects which are more suitable for spatial data.

::: {.callout-tip}
This notebook has a sibling, which demonstrates how to access the same data using the GDAL Zarr driver. You can find it [here](./55_zarr_R_gdal.qmd).
:::

### What we will learn

- ✏️ How to edit URLs of Zarr stores to make them readable for GDAL
- 🔎 Which read-functions and arguments to use in `stars` and `terra`
- 🚧 Current limitations of these packages

### Prerequisites

To follow this notebook, you will need to load a STAC item in Zarr format from the [EOPF Zarr STAC Catalog](https://stac.browser.user.eopf.eodc.eu/?.language=en). If you are new to STAC inside the R environment, we suggest to follow the [Access the EOPF Zarr STAC API with R tutorial](https://eopf-toolkit.github.io/eopf-101/51_eopf_stac_r.html). The item we are using for this example mirrors the one used in the [Sentinel-2 L1C MSI Zarr Product Exploration notebook](https://eopf-sample-service.github.io/eopf-sample-notebooks/sentinel-2-l1c-msi-zarr-product-exploration/#introduction) which is hosted on EODC's object storage and is accessible under the URL below.

```{r}
zarr_url = "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:sample-data/tutorial_data/cpm_v253/S2B_MSIL1C_20250113T103309_N0511_R108_T32TLQ_20250113T122458.zarr"
```

### Import packages

```{r}
#| message: false
#| warning: false

library(Rarr) # reading; BiocManager::install("Rarr")
library(stars) # spatial objects
library(dplyr) # table operations
library(jsonlite) # parsing metadata
library(mapview) # interactive maps
```

## Metadata exploration with Rarr

::: {.callout-important}
In this notebook we show the current capabilities and limitations of `Rarr` in terms of reading Zarr array(s) and their metadata. Successful attempts are marked with a ✅, failed attempts with a ⛔.
:::

Using the `Rarr::zarr_overview` function we can get meta data for all assets in the remote Zarr store as a table. Here list all arrays and their data type, compression, dimensions, etc.

```{r}
zarr_content = zarr_overview(zarr_url, as_data_frame = T) |>
mutate(var = sub(zarr_url, "", path)) |>
select(var, everything(), -path)

filter(zarr_content, grepl("reflectance", var))
```

We can see that all Sentinel-2 bands are present, but split into directories (or groups) based on spatial resolution (10, 20, 60 meters), which have different dimensions (number and size of pixels).

::: {.callout-tip collapse="true"}
## Expand for full Zarr metadata output

```{r}
#| echo: false

print(zarr_content)
```
:::

Now that we have explored the content, let's get some related metadata. `Rarr` provides the `read_zattrs` function for this purpose, which looks for the `.zattrs` file inside the Zarr store.

⛔ Unfortunately we were not successful when applying it to our remote store.
```{r}
#| error: true

read_zattrs(zarr_url)
read_zattrs(file.path(zarr_url, ".zattrs"))
```

✅ As a simple workaround we can read the JSON directly:
```{r}
zattrs = read_json(file.path(zarr_url, ".zattrs"))
names(zattrs)
```

::: {.callout-tip collapse="true"}
## Expand for full Zarr metadata output

```{r}
#| echo: false

print(zarr_content)
```
:::

The same information can be retrieved with `sf::gdal_utils('mdiminfo', ...)`.

## Reading Zarr data with `Rarr`

Reading zarrays from a remote Zarr store with `Rarr` and turning them into useful spatial R objects is not as straightforward as with `GDAL`-based `stars` and `terra` functions. We will demonstrate the current capabilities and limitations of `Rarr` by using a reference object retrieved through the `GDAL` Zarr driver which is loaded in the background.

```{r}
#| echo: false
#| warning: false

# parse GDAL-compatible VSI URL
vsi_url = paste0("ZARR:/vsicurl/", dplyr::as_label(zarr_url))

# read from remote and set CRS
band_variable = "/measurements/reflectance/r60m/b01"
zz_gdal = read_stars(paste(vsi_url, band_variable, sep = ":"))
st_crs(zz_gdal) = st_crs(32632)
```

```{r}
# GDAL-based reference object
zz_gdal
```

We start with an naive attempt to read the entire Zarr store using `Rarr::read_zarr_array`.


⛔ Rarr, like `stars` and `terra`, is not capable of reading multiple arrays from the remote Zarr store simultaneously.

```{r}
#| error: true

read_zarr_array(zarr_url)
```

✅ But if a single band array is selected we are able to access the desired information.

```{r}
band_variable = "/measurements/reflectance/r60m/b01"

zarray = read_zarr_array(paste0(zarr_url, band_variable))
str(zarray)
```

### Raster geometry

The function returns a matrix (2D array) lacking any spatial metadata (e.g. Coordinate Reference System) or map coordinates. Let's plot it together with the reference object for comparison.

```{r}
#| fig-subcap:
#| - "Rarr (raw zarray)"
#| - "GDAL (directly retrieved stars object)"
#| layout-ncol: 2
#| fig-height: 7

image(zarray)
image(zz_gdal$b01)
```


::: {.callout-warning}
## Caution

The raw zarray has flipped X- and Y-dimensions! But we can access X and Y coordinates for each pixel separately and handle the issue by transposing them later (using `t()`).
:::

```{r}
# retrieve X and Y coordinates
zarray_60m_x = paste0(zarr_url, "/measurements/reflectance/r60m/x") |>
read_zarr_array()
zarray_60m_y = paste0(zarr_url, "/measurements/reflectance/r60m/y") |>
read_zarr_array()

range(zarray_60m_x)
range(zarray_60m_y)
```

Regardless of their ordering, are the coordinate values consistent with those retrieved via GDAL?

```{r}
range(st_get_dimension_values(zz_gdal, "x"))
range(st_get_dimension_values(zz_gdal, "y"))
```

✅ Yes!

### Reflectance values

Before building the spatial object we should also check if the actual data values are consistent between both methods.

```{r}
head(t(zarray)[1,]) # flipped zarray
head(zz_gdal$b01[1,]) # reference arrray
```

::: {.callout-warning}
## Caution

By comapring the first few values of each array we can observe 2 important differences:

- `zarray` contains integer values while `zz_gdal` has floating point values scaled with factor 10,000.
- after rescaling, min and max for example still differ (0.1299 vs. 0.2299 / 1.6556 vs. 1.7556), indicating an offset of 0.1.

💡 **Explanation:** *The data provider applies scale and offset to store more memory-efficient integer values. A scale factor of 10,000 preserves 4 decimals, while an offset of 0.1 avoids the multiplication of 0. When reading raw data like that we need to apply scale and offset manually to get the original reflectance values.*
:::

Scale and offset parameters are important metadata, but currently not accessible via `Rarr`. At this point we can only find it by digging deep into the metadata retrieved by `GDAL`:

```{r}
# GDAL-comaptible VSI URL
vsi_url = paste0("ZARR:/vsicurl/", dplyr::as_label(zarr_url))

# get entire metadata
info = sf::gdal_utils("mdiminfo", source = vsi_url, quiet = T) |> fromJSON()

# access scale and offset for 60 meter reflectance bands
(scale = info$groups$measurements$groups$reflectance$groups$r60m$arrays$b01$scale)
(offset = info$groups$measurements$groups$reflectance$groups$r60m$arrays$b01$offset)
```

### Constructing a `stars` object from zarray components

Let's combine all pieces of the puzzle to form a spatial object:

- data array
- scale and offset
- X and Y coordinates
- CRS

```{r}
# read CRS from Zarr attributes
zarr_crs = zattrs$stac_discovery$properties$`proj:epsg`

# band name
var = basename(band_variable)

# read and transpose data matrix
# apply scale and offset
# convert to stars
zz = st_as_stars(t(zarray) * scale + offset) |>
st_set_dimensions(1, names = "x", values = zarray_60m_x, is_raster = TRUE) |>
st_set_dimensions(2, names = "y", values = zarray_60m_y, is_raster = TRUE) |>
setNames(var)

# assign CRS
st_crs(zz) = st_crs(zarr_crs)

# print the object
zz
```

We can compare the newly constructed object with reference read via GDAL. They should have identical geometry and values.

```{r}
identical(zz[[1]] |> str(),
zz_gdal[[1]] |> str())
```

Looks good! Let's visualize both objects.

```{r}
#| fig-subcap:
#| - "Rarr (constructed stars object)"
#| layout-ncol: 1

# plot as maps
plot(zz, axes = TRUE)
```

```{r}
#| fig-subcap:
#| - "GDAL (directly retrieved stars object)"
#| layout-ncol: 1

plot(zz_gdal, axes = TRUE)
```

The identical maps show Sentinel-2 band 1 (coastal aerosol band) reflectance at 60m spatial resolution for north-western Italy with the Alps in the West and the city of Torino in the North-East.

### Wrapping it all up in a function

We can enclose these steps in a single function with some flexibility for variable names and resolutions. This allows us to retrieve any band at any available resolution from the remote Zarr store. Since scale and offset are not accessible via `Rarr`, we need to provide them as function arguments.

```{r}
st_read_zarray = function(path, var, res, scale = 1, offset = 0, ...){

# room for checks and input tests:
# stopifnot valid zarr url / variable / resolution ...

# get metadata including CRS
zattrs = jsonlite::read_json(file.path(path, ".zattrs"))
zarr_crs = zattrs$stac_discovery$properties$`proj:epsg`

# selecte the correct spatial resolution
# zattrs$stac_discovery$properties$`eopf:resolutions`
res_char = switch(as.character(res),
"10" = "r10m",
"20" = "r20m",
"60" = "r60m",
stop("Resolution must be one of 10, 20, or 60.")
)

# get 2D data array and 1D coordinate arrays
# ...: make use of index arguments of read_zarr_array if desired
zarray = file.path(path, "measurements/reflectance", res_char, var) |>
read_zarr_array(...)
zarray_x = file.path(path, "measurements/reflectance", res_char, "x") |>
read_zarr_array(...)
zarray_y = file.path(path, "measurements/reflectance", res_char, "y") |>
read_zarr_array(...)

# transpose matrix before converting to stars, apply scale and offset
# assign x and y coordinates and CRS
z = st_as_stars(t(zarray) * scale + offset) |>
st_set_dimensions(1, names = "x", values = zarray_60m_x, is_raster = TRUE) |>
st_set_dimensions(2, names = "y", values = zarray_60m_y, is_raster = TRUE) |>
setNames(var)
st_crs(z) = st_crs(zarr_crs)
return(z)
}
```


## Read and combine multiple zarrays

✅ Now we can read multiple bands, combine them to a single multi-band object, and visualize them in a static map...

```{r}
system.time({
# scale: 0.0001 = 1 / 10,000
b01 = st_read_zarray(zarr_url, "b01", 60, scale=0.0001, offset = 0.1)
b09 = st_read_zarray(zarr_url, "b09", 60, scale=0.0001, offset = 0.1)
multi_band = c(b01, b09) |> merge(name="band") |> setNames("reflectance")
})

plot(multi_band, axes = TRUE)
```

... or by generating an interactive map:

```{r}
#| echo: false
mapviewOptions(basemaps = c("OpenTopoMap","Esri.WorldImagery",
"CartoDB.Positron"))
```

```{r}
#| message: false
#| cache: true

mapview(multi_band)
```


## Benchmark

Using `Rarr` to read remote Zarr stores is just one of several options. R packages `stars` and `terra` provide functionality, that uses GDAl's Zarr driver in the background. To see if there is a benefit to each of the methods, we can compare their execution time and memory allocation with a little benchmark.

```{r}
# GDAL-compatible VSI URL
vsi_prefix = "ZARR:/vsicurl/"
vsi_url = paste0(vsi_prefix, dplyr::as_label(zarr_url))

# functions for reading a single Sentinel-2 band to memory
f_stars = function(){rs = read_stars(paste(vsi_url, band_variable, sep = ":"))}
f_mdim = function(){sm = read_mdim(vsi_url, band_variable, proxy = FALSE) |>
setNames(basename(band_variable))}
f_rast = function(){tr = terra::rast(vsi_url, band_variable) |> terra::toMemory()}
f_rarr = function(){rr = st_read_zarray(zarr_url, "b01", 60)}
```

```{r}
#| warning: false
#| cache: true

bench::mark(f_stars(), f_mdim(), f_rast(), f_rarr(),
check = F, iterations = 5)
```

The benchmark does not reveal a clear winner in terms of speed. However, `Rarr` seems not to be faster than `GDAL` but allocates more memory compared to `stars` and `terra`.

## 💪 Now it is your turn

- 🔭 **Task**: Supply more arguments to `st_read_zarray` and pass a subset to read only parts on an array. Visit `?Rarr::read_zarr_array` for more details.
- 💾 **Task**: Try writing zarrays to disk using `Rarr::write_zarr_array()`.
- 📖 Read the [Rarr documentation](https://huber-group-embl.github.io/Rarr/articles/Rarr.html).

## Conclusion

In this notebook we have demonstrated how to access remote Zarr stores in R using the `Rarr` package. We have seen that it is possible to read specific data arrays, but `read_zarr_array` lacks the ability of accessing anything other than the raw array. To combine it with a CRS or coordinates we need to leverage a spatial class like `stars` and manually set the relevant metadata. Further, the benchmark did not reveal a significant benefit regarding processing time or memory allocation over "traditional" GDAL-based data retrieval.

## What's next?

Further tests should evaluate wether data retrieved via `Rarr` versus GDAL are identical in terms of spatial extent and offset.

Explore [this related notebook](./55_zarr_R_gdal.qmd) which demonstrates how to access the same Zarr data using the GDAL Zarr driver via the `stars` and `terra` packages.
Loading
Loading