cloud_optim_data.qmd

---
execute:
  eval: false
---

# Making the most of Cloud Optimized formats
An excellent introduction to cloud-native geospatial formats can be found [here](https://guide.cloudnativegeo.org/).
These cloud optimized formats provide the benefit of parallel and partial reading from the data source, meaning that an area of interest can be a subset from a raster file or an attribute or specific geometry can be extracted from a vector without the need to download and process the entire dataset. 

This chapter is designed to provide an introduction to using some of these formats in R and Python. Conveniently, many of these formats are processed the same way their non-cloud optimized counterparts would be. It should be noted that these cloud optimized formats are still very young, and many tools and functions are still being developed for use with them. This, alongside a very active community, means that this chapter of the document will be very dynamic as new tools become available and formats become more developed. 

## Cloud Optimized geoTIFFs (COGs)
As the name implies, this is the cloud-optimized version of a geotiff. It uses the same (.tif) file extension, but it contains extra metadata and internal overviews. These act exactly the same as a normal geotiff and are backward compatible with standard geotiffs. To get the most from this format when using gdal-based tools, [GDAL Virtual File Systems](https://gdal.org/user/virtual_file_systems.html) should be used to read/write data from the cloud. To do so, the following should be appended to the start of the url, depending on where the data is:
 - '/vsicurl/' for http/https/ftp or public aws/gcs buckets
 - '/vsis3/' for files in s3 buckets
 - '/vsigs/' for files in google cloud storage
 - '/vsiaz/' for files in azure cloud storage
These will work for both raster and vector data and can be used in R, QGIS, GDAL CLI, etc. 

```{r}
library(terra)
# on private S3 bucket
private_cog <- rast('/vsis3/digital-atlas/MapSpam/intermediate/spam2017V2r3_SSA_V_TA-vop_tot.tif')
# on public S3 bucket
Sys.setenv(AWS_NO_SIGN_REQUEST = TRUE)
public_cog <- rast(paste('"/vsis3/copernicus-dem-30m/Copernicus_DSM_COG_10_S90_00_W172_00_DEM/Copernicus_DSM_COG_10_S90_00_W172_00_DEM.tif"', gdalconfig))
# from an https link. 
https_cog <- rast('/vsicurl/https://esa-worldcover.s3.eu-central-1.amazonaws.com/v200/2021/map/ESA_WorldCover_10m_2021_v200_S15E027_Map.tif')

# # same process with stars
library(stars)
private_s3_cog <- read_stars('/vsis3/digital-atlas/MapSpam/intermediate/spam2017V2r3_SSA_H_TA-ha_tot.tif', proxy = T)
```

## ZARR
This format fills the space of netCDF, HDF5 and similar multi-dimensional data cubes. Currently, the ZARR ecosystem is much more developed for Python; however, there is an active push to bring it to other languages. GDAL has some functionality for working with multi-dimensional arrays and ZARR but it is limited. This limits its use in in R spatial world for the time being. In R, Zarr datasets can be best accessed using the stars package, although with limitations and caveats. 
```{r}
#note that currently read_mdim reads the full data set into memory,
# so anything with less than 32gb ram will likenly not work
library(stars)
Sys.setenv(AWS_NO_SIGN_REQUEST = TRUE)
zarr_store = 'ZARR:"/vsis3/cmip6-pds/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp585/r1i1p1f1/day/psl/gr1/v20180701/"'
terra::sds(zarr_store)

# info = gdal_utils("mdiminfo", zarr_store, quiet = TRUE)
# jsonlite::fromJSON(info)$dimensions
# zarr <-  read_mdim(zarr_store, count = c(NA, NA, 97820))
```

The Python ecosystem is much more developed for ZARR at the time being. Xarray is the primary package for working with ZARR and other multidimensional data. 
```{python python.reticulate = FALSE}
import s3fs
import xarray as xr #pip install xarray[complete] for use with ZARR data

# Loading the data
s3 = s3fs.S3FileSystem(profile="ca_zarr") #a profile saved in the .aws/credentials file
def open_s3_zarr(zarr_s3path: str, s3: s3fs.core.S3FileSystem) -> xr.Dataset:
  store = s3fs.S3Map(root=zarr_s3path, s3=s3)
  return xr.open_zarr(store)

humidityZR = open_s3_zarr(
    's3://climate-action-datalake/zone=raw/source=agera5/variable=relativehumidity.zarr/',
    s3)

X = 7.65
Y = -37.33

point = humidityZR.sel(lat = X, lon = Y, method = 'nearest')
vals = point.Relative_Humidity_2m_12h.values
point.Relative_Humidity_2m_12h.plot()
```

## geoparquet
This is one of the main cloud optimized formats for geospatial vector datasets. If GDAL 3.8 is installed it will use the newest geoparquet 1.0.0 spec. The same virtual file system prefixes used for rasters can be used to access these. **Before GDAL 3.5 there is no geoparquet support.** For Python users, geopandas supports the latest 1.0.0 geoparquet spec.

For R users:
```{r}
# This section is expected to rapidly change upon release of the 'geoparquet-r package.

### Using terra (sf will be similar)
library(terra)
print(paste("GDAL version:", gdal()))
if (any(grepl("Parquet", gdal(drivers = TRUE)))) {
  print("Parquet driver available")
}

pq_uri <- '/vsis3/digital-atlas/risk_prototype/data/exposure/crop_ha_adm0_sum.parquet'
s3_parquet <- vect(pq_uri)
# or to filter it by extent while reading
aoi = ext(c(-4.5, 51.41, -34.83, -1.6))
ext_parquet <- vect(pq_uri, extent = aoi) # filter = X can be used for a vector extent
# or filter and perform operations on columns through SQL query
# NOTE: these methods can also be used for any vector format (.shp, .gpkg, etc)
sql_query <- r"(
SELECT "sum.wheat" as wheat, admin_name
FROM  crop_ha_adm0_sum
WHERE admin_name LIKE '%N%' OR "sum.wheat" > 5000
)"

queried_parquet <- vect(pq_uri, query = sql_query)

# to write a parquet
example <- vect(ext(c(-4.5, 51.41, -34.83, -1.6)), crs = 'EPSG:4326')

x <- writeVector(example, '/vsis3/s3/path/name.parquet', filetype = 'parquet')
``` 

```{r}
### Using geoparquet-R
# Package currently in development

### Using SFarrow
# Package is depreciated and it is not encouraged unless it is the only options
# Note an error may occur if a geoparquet was made with the newest 1.0.0 spec 
library(sfarrow)
# To return the full parquet in memory
sf_pq <- sfarrow::st_read_parquet("s3://digital-atlas/risk_prototype/data/exposure/crop_ha_adm0_sum.parquet")

# To access and query the parquet from out of memory first 
# See the next section for more on quering with arrow datasets
library(arrow)
pq_ds <- arrow::open_dataset('s3://digital-atlas/risk_prototype/data/hazard_mean/annual/haz_means_adm2.parquet')
filtered_sf <- pq_ds |> 
    dplyr::filter(iso3 == 'UGA') |>
    sfarrow::read_sf_dataset(find_geom = T)
```

## PARQUET/ARROW
This is the cloud optimized answser to tabular datasets such as csv, tsv, and excel. The Arrow package for R provides the main interface to this data format and it is very well documented. If you want to understand and take advantage of everything this format and package can offer, check out the [Arrow R package documentation](https://arrow.apache.org/docs/r/index.html).

```{r}
### Using the r Arrow package
library(arrow)
uri <- 's3://digital-atlas/risk_prototype/data/hazard_risk_vop_ac/annual/haz_risk_vop_ac_reduced.parquet'
#to read the full parquet in memory
parquet_tbl <- read_parquet(uri) #returns a normal dataframe of the parquet

# to access and query the parquet from out of memory
ds <- open_dataset(uri, format = "parquet")
ds_scan <- ds$NewScan()
ds_scan$Filter(Expression$field_ref("admin0_name") == "Kenya")
filtered <- dataset <- ds_scan$Finish()$ToTable() #returns an arrow table
filtered_df <- as.data.frame(dataset)

# Arrow was also designed to play well with dplyr verbs on out-of-memory tables
library(dplyr)
ds <- read_parquet(uri, as_data_frame = FALSE) #open_dataset also works here
filtered_df <- ds |>
                filter(admin0_name == "Lesotho") |>
                collect()

#See the next section to learn how to query these using native AWS s3 methods
```

## Running SQL queries on the cloud for CSV, JSON, and parquet datasets
AWS S3 allows users to query data from CSV, Parquet, or JSON files using SQL on the cloud. This allows CSV and JSON files, which are not necissarily considered "cloud-optimized", to be accessed and queried very quickly from the cloud. This can also be used for parquet files, although the returned data will always be in either csv or json format. [This website](https://nebius.ai/docs/storage/concepts/s3-select-language) is a useful help guide to the AWS S3 select SQL syntax. 

```{r}
library(paws.storage)
bucket <- paws.storage::s3()

sql_query <- "
SELECT scenario, admin0_name
FROM S3Object
WHERE admin0_name= 'Lesotho'
"
#Note that 'S3Object', not the file path/key, is the table name in "FROM"

aws_result <- bucket$select_object_content(
    Bucket = 'digital-atlas',
    Key = 'MapSpam/raw/',
    Expression = sql_query, 
    ExpressionType = 'SQL', 
    InputSerialization = list(
    'CSV' = list(FileHeaderInfo = "USE")
    ),
  OutputSerialization = list(
    'CSV'= list(
      QuoteFields = "ASNEEDED"
    )
))

data <- read.csv(text = aws_result$Payload$Records$Payload, header = FALSE)
data

### Or query a parquet from S3

# A the query may have a mix of quotes in it, the r"()" raw fun. can be useful
# In this case, 'value' is both a reserved s3 select word and a column name. 
# without "" surrounding value it will not be interpreted as a column
# Some complex queries allowing aggregation are also allowed
sql_query <- r"(
SELECT SUM("value") AS total_vop, AVG("value") AS avg_vop
FROM S3Object
WHERE exposure = 'vop' AND crop = 'wheat'
)"

output <- bucket$select_object_content(
    Bucket = 'digital-atlas',
    Key = 'risk_prototype/data/exposure/exposure_adm_sum.parquet',
    Expression = sql_query, 
    ExpressionType = 'SQL', 
    InputSerialization = list(
    'Parquet' = list()
    ),
  OutputSerialization = list(
    'CSV'= list(
      QuoteFields = "ASNEEDED"
    )
))

data <- read.csv(text = output$Payload$Records$Payload, header = FALSE)
data
```

Here are some example queries which could be of use:
```{sql}
-- Selects all columns where admin0 is Tanzania
SELECT *
FROM S3Object
WHERE admin0_name = 'Tanzania'


-- Selects first 5 rows (similar to head())
SELECT *
FROM S3Object
LIMIT 5

-- Calculates the average wheat vop in a dataset.
-- Again, note that value is in "" due to it being a reserved word and a column
SELECT AVG("value") as avg_wheat_vop, SUM(total_pop) as all_population
FROM S3Object
WHERE exposure = 'vop' AND crop = 'wheat'

-- Selects the unique crop names in a dataset
SELECT DISTINCT crop
FROM S3Object
```