Data cubes for earth observation data

What are Datacubes?

A data cube is a multidimensional array of values, typically used to represent data in a structured format for analysis and querying. While the number of dimensions can vary, three dimensions are most common. This is because much of the data we encounter is two-dimensional, such as spreadsheets or images. When you stack hundreds or thousands of these two-dimensional datasets—like satellite images taken over different months—you create a three-dimensional structure, with time serving as the third dimension. This stacked format resembles a cube, hence the name "data cube."

In a geographic context, the typical dimensions for width and length correspond to the x- and y-coordinates of latitude and longitude. Depending on the use case, the third dimension could be depth (e.g., when working with geologic or oceanographic data), altitude (e.g., when analyzing atmospheric data), or time, which is the leading third dimension in the context of earth observation.
data cubes Architecture of a data cube; Source: Kopp et al.

Benefits

Handling large amounts of data

When observing the Earth, the amount of data collected and stored is enormous. The organized, array-based structure of a data cube facilitates access to the needed data.

Efficiency

Due to the organized structured by the grouped data, it is quick and easy for the user to find and retrieve data.

Users can make different kinds of queries:

Request data from a specific location (e.g., a time series of observations from that area)
Request data from a specific time (e.g., an image from that date).

Complex analyses

The multidimensional structure enables complex analyses, such as time series analysis and multidimensional querying.

Scalability

Data cubes are designed to handle large volumes of data, which is important in Earth observation where the datasets can reach petabytes in size. Their scalability ensures that they can accommodate growing datasets without sacrificing performance.

Interoperability

Data cubes can integrate data from various sources and sensors, harmonizing different formats and spatial resources. This is helpful for researchers who can then easily examine trends, patterns and relationships in the data over different dimensions, such as geographical location and time.

Analysis Ready Data (ARD)

Data once stored and processed correctly in a data cube, can directly be used for computation, e.g. for machine learning.

Challenges

Large amount of data

The biggest challenge when working with data cubes is the sheer volume of data, which can reach petabytes in size. For storing those amounts of data locally a certain infrastructure is necessary of which the costs can be prohibitively high. Currently, it is most effective to use public commercial clouds like Amazon AWS, Google Cloud, or Microsoft Azure to optimize data storage. Additionally, data compression techniques for images can help manage storage needs.

Decisions

Beyond storage solutions, decisions must be made regarding file formats (e.g., JPEG, GeoTIFF) and their respective advantages and disadvantages. It's also essential to plan how to process data within the data cube and consider its overall design for optimal performance.

Preconditions

Decisions regarding processing and design impact performance during computationally intensive tasks. Specific hardware and algorithms are often required to facilitate these computations effectively.

Data Integration

While a data cube with correctly stored data is practical, there are several challenges in data integration to become ‘analysis ready data’ (ARD):

The data must be geometrically correct and in the appropriate coordinate system.
Atmospheric corrections must be applied to eliminate distortions.
Cloud masking is necessary to remove clouds, ensuring a clear view of the surface.
Data from different sources must be standardized to ensure compatibility.
The time dimension must be accurately represented.

Expertise needed

Many users who want to work with data cubes may not possess the necessary technical skills or experience to effectively execute complex queries and data analyses. This can lead to inefficient use of the data and limit the organization's ability to analyze and make decisions based on the insights derived from the data.

Legal and ethical challenges

Legal and ethical challenges in data cubes and Earth observation primarily revolve around privacy concerns and data protection regulations. Organizations must handle the collected data responsibly, consider individuals' privacy, and avoid misuse. To ensure that data cubes follow the FAIR principles, ISO 19123-1/19123-3 were instated and updated.

Use Cases

Water Resources Management

Data cubes have been effectively used for managing water resources by tracking changes in water extent, monitoring wetlands, and assessing flood risks.

Environmental Monitoring

Via Environmental Monitoring it is possible to perform time series analyses for monitoring forest cover, coastline changes and urban expansion. This is interesting for e.g. governments who want to track illegal activities.

Disaster Management

By providing timely and easily accessible data, data cubes support disaster response efforts. For example, they have been used to track flood risks and assess the aftermath of extreme weather events.

This is consequently also important for the government and their political decisions in a natural disaster.

Practical Examples

Australian GeoScience Data Cube

Developed by Geoscience Australia and hosted on the National Computational Infrastructure at the Australian National University, the Australian GeoScience Data Cube represents a significant advancement in Earth observation data management. This pixel-aligned collection comprises over 300,000 Landsat scenes from across Australia, all geometrically and spectrally calibrated to accurately reflect the Earth's surface. As the first continental Landsat data cube featuring overlapping temporal scenes, it paves the way for innovative analytical approaches.

Australia uses data from the Data Cube for various purposes, such as scientific research, improving community safety, exploring mineral resources, navigation, and supporting the country's prosperity.

The Australian Geoscience Data Cube laid the foundation for Digital Earth Australia (DEA). DEA is a newer initiative that provides a comprehensive platform, enabling users to access and analyze Earth observation data through user-friendly tools and services. This platform extends the utility of the data cube, offering insights for decision-making across sectors like environmental monitoring, agriculture, and disaster management.

DEA provides free satellite imagery amounting to nearly 1.5 petabytes. The data is 'Analysis Ready Data' (ARD), making analyses as easy as possible for potential users.

Digital Earth Africa

Digital Earth Africa (DEA) is the largest project under NASA's Committee on Earth Observation Satellites (CEOS) initiative, utilizing the Open Data Cube technology to create a comprehensive data cube for the African continent. It includes datasets from Landsat and Sentinel-1, offering a rich collection of satellite imagery and web services. Through DEA, users can explore spectral indices over time and perform interactive change detection using spectral and radar data. A notable example is the detection of changes in the Volta River area in central Ghana. The data is easily accessible via the Africa GeoPortal.

Sources

Achieving the Full Vision of Earth Observation Data Cubes; Kopp et al.

https://www.mdpi.com/2306-5729/4/3/94 (Stand: 06.10.2024)

OBSERVER: Data cubes: Enabling and facilitating Earth Observation applications

https://www.copernicus.eu/en/news/news/observer-data-cubes-enabling-and-facilitating-earth-observation-applications (Stand: 05.10.2024)

Earth observation data cubes for water resources management

https://www.space4water.org/s4w/web/news/earth-observation-data-cubes-water-resources-management (Stand: 06.10.2024)

Digital Earth Africa

https://www.digitalearthafrica.org (Stand: 06.10.2024)

ISO Norm 19123-1

https://committee.iso.org/sites/tc211/home/projects/projects---complete-list/iso-19123-1.html (Stand: 08.10.2024)

Australian GeoScience Data Cube

https://www.ga.gov.au/scientific-topics/dea/about/open-data-cube (Stand: 04.10.2024)

stac client for python:

Grundlagen

import json # Zur Anzeige der Abfrageergebnisse import pystac import requests # Für Interaktion mit APIs

from pystac import Catalog, get_stac_version # Erweiterung von pystac zum Einbinden von bestehenden Catalogs from pystac_client import Client # Erweiterung von pystac u.a. zum suchen in STACs

Einbinden eines STACs

root_catalog = Catalog.from_file('https://raw.githubusercontent.com/stac-utils/pystac/main/docs/example-catalog/catalog.json') root_catalog.describe() # Aufbau des Catalogs

Informationen über den Catalog

Informationen über den Stac Clienten

print(f"ID: {root_catalog.id}") print(f"Title: {root_catalog.title or 'N/A'}") print(f"Description: {root_catalog.description or 'N/A'}")

Informationen über die vorhandenen Collections

collections = list(root_catalog.get_collections()) # get_collections() und weitere Func. im Handout erläutert print(f"Number of collections: {len(collections)}") # Anzahl der vorhandenen Collections print("Collections IDs:") for collection in collections: print(f"- {collection.id}")

Informationen über die vorhandenen Items

items = list(root_catalog.get_all_items()) print(f"Number of items: {len(items)}") for item in items: print(f"- {item.id}")

Abfragen von einzelnen STAC Objekten

item = root_catalog.get_item("LC80140332018166LGN00", recursive=True) # Einzelenes Item, weitere Benutzung im Folgenden

collection = root_catalog.get_collection("ID", recursive=BOOL)

weitere möglich

Abfragen von Metadaten

print(item.geometry) print(item.bbox) print(item.datetime) print(item.collection_id) item.get_collection() # Abfrage, zu welcher Collection das item gehört

Erweiterte Metadaten abfragen (Common Metadata)

print(item.common_metadata.instruments) print(item.common_metadata.platform) print(item.common_metadata.gsd)

Abfrage von Assets

for asset_key in item.assets: #.assets als Func zur Abfrage aller Assets eines Items asset = item.assets[asset_key] print('{}: {} ({})'.format(asset_key, asset.href, asset.media_type)) # asset-key,(..) werden in den String {},(..) eingesetzt

Informationen über ein Asset

asset = item.assets['B3'] print(asset.to_dict()) # Ähnlich zur Abfrage mit .format

Speichern von Daten aus einem STAC

for asset_key in item.assets: asset = item.assets[asset_key] asset_url = asset.href file_name = asset_key + '.' + asset.media_type.split('/')[-1]

# Fragt die Daten von der API ab
response = requests.get(asset_url) # Nutzung der requests-Library

# Speichere die Datei
with open(file_name, 'wb') as f:
    f.write(response.content)

print(f'{file_name} heruntergeladen.')

Weitere Optionen mit der Client-Extension von pystac

STAC einbinden

catalog_url = 'https://planetarycomputer.microsoft.com/api/stac/v1' client = Client.open(catalog_url) # Client interagiert mit API-Endpunkt (URL)

Suche nach Items (Beispiel Sentinel-2)

search = client.search( collections=['sentinel-2-l2a'], bbox=[-47.02148, -17.35063, -42.53906, -12.98314], datetime='2023-01-01/2023-01-31', limit = 10 )

Ein Item aus den Ergebnissen abrufen

items = list(search.items()) print(len(items)) print(items) item = items[5] print(f"Item ID: {item.id}") print(f"Item datetime: {item.datetime}")

Informationen der Items aus den Ergebnissen abrufen

for asset_key, asset in item.assets.items(): print(f"Asset Key: {asset_key}") print(f"Asset URL: {asset.href}") print(f"Asset Media Type: {asset.media_type}")

stac client for R:

Installieren der benötigten Pakete:

install.packages("rstac") install.packages("sf") install.packages("terra") install.packages("tibble") library(terra) library(sf) library(tibble) library(rstac)

Angeben einer URL zu einer Stac API

stac_url <- "https://planetarycomputer.microsoft.com/api/stac/v1"

Aufbau der Verbindung zur API

s_obj <- stac(stac_url) str(s_obj)

Informationen über die stac API:

get_request(s_obj)

alternativ:

s_obj %>% get_request()

Anforderungen und Spezifikationen:

conformance_classes <- s_obj %>% conformance() %>% get_request() conformance_classes

Abfrage des Collections-Endpoint, nicht des top-level Endpoints

collections_query <- s_obj %>% collections()

Liste aller verfügbaren Collections, also der verfügbaren datasets

collections_query %>% get_request()

Suchanfrage an den Catalog stellen:

stac_search( q = s_obj, collections = "usgs-lcmap-conus-v13", datetime = "2021-01-01/2021-12-31", limit = 10 ) %>% get_request()

Beispiel zur Suche nach bestimmter Region (Ashe Couty, North Carolina)

ashe <- read_sf(system.file("shape/nc.shp", package = "sf"))[1, ] plot(st_geometry(ashe))

ashe_bbox <- ashe %>% st_transform(4326) %>% st_bbox() ashe_bbox

stac_query <- stac_search( q = s_obj, collections = "usgs-lcmap-conus-v13", bbox = ashe_bbox, datetime = "2021-01-01/2021-12-31", limit = 10 ) %>% get_request() stac_query

signed_stac_query <- items_sign( stac_query, sign_planetary_computer() # Authentifizierung beim Planetary Computer ) signed_stac_query

Download des Datasets (lcpri: primary land coverage)

output_directory <- "C:/Users/lraeu/OneDrive/Desktop/Geosoftware II/geosoft2-2024/data" assets_download(signed_stac_query, "lcpri", output_dir = output_directory, overwrite = TRUE) output_file <- file.path("C:/Users/lraeu/OneDrive/Desktop/Geosoftware II/geosoft2-2024/data/lcmap/CU/V13/025011/2021/LCMAP_CU_025011_2021_20220721_V13_CCDC/LCMAP_CU_025011_2021_20220629_V13_LCPRI.tif") %>% rast() plot(output_file) rast("C:/Users/lraeu/OneDrive/Desktop/Geosoftware II/geosoft2-2024/data/B1.tiff")

ashe %>% st_transform(st_crs(output_file)) %>% st_geometry() %>% plot(add = TRUE, lwd = 3)

Data cubes for earth observation data