Cirrus geospatial pipeline #12

matthewhanson · 2020-08-31T22:39:01Z

Hey all, @scottyhq mentioned this forge thing a few weeks ago, and @rsignell-usgs suggested I post here. We recently open-sourced Cirrus, an AWS pipeline for processing geospatial data. On the surface it sounds like this is similar to Pangeo forge but after some reading it seems like they are pretty different things, although there might be some synergistic efforts we could discuss.

What is Cirrus?
Cirrus is a mostly serverless AWS architecture that makes use of DynamoDB, Lambda, SNS, SQS, Step Functions, and Batch.
Cirrus is meant for scaling up, and doing both historical and ongoing real-time processing of geospatial data via STAC metadata and assets.

See the repo and docs on architecture and usage.

What can you use Cirrus for?
Cirrus has served as the backend processing architecture for a few projects. With it you can:

Fetch metadata from another source, transform to STAC and publish it
Fetch STAC metadata from an API, s3 bucket (or s3 inventory) and perform processing on assets such as copying, converting to COG, or generating preview images and thumbnails
Count how many jobs are in a state of PROCESSING, COMPLETED, or FAILED by datetime or collection via an API
Monitor and track jobs through the system, links provided to original input and complete Step Function execution logs for all tasks within a workflow
Rerun jobs based on last state, datetime, or collection
Publish STAC metadata via SNS to be consumed by something like STAC-server
Maintain data provenance when transforming data by including "derived_from" and "copied_from" links back to the source STAC metadata
Chain together Cirrus workflows by providing default processes for collections. Cirrus can consume it's own published data and assign a new workflow based on the the Collection the Item belongs to. e.g, L0->L1->L2

Sentinel-2 COG Public Dataset
I've used Cirrus to create the new Sentinel-2 COGs. This started off for just Africa, but we are currently more than half way through processing the entire global archive from the original JP2K format, about 6 million scenes. This past weekend I generated nearly 2 million converted Sentinel scenes using Cirrus. Each scene is 17 COGs, so that's 34 million file conversions in a couple days. All failures and successes tracked, so that errors can be identified and dealt with and subsequent runs later don't reprocess completed data.

First I indexed the existing AWS Public datasets: 20 million sentinel-s2-l1c and sentinel-s2-l2a scenes in JP2K, by converting the sentinel metadata (tileInfo.json) into STAC Items and indexing those. Then using that STAC API, get the L2A scenes and convert them to COGs, linking back to the L2A record for provenance, and publishing/indexing those back in Earth-Search (the sentinel-s2-l2a-cogs collection). All these steps for publishing, converting, copying, were done with Cirrus.

The COG archive is available through Earth-search: https://earth-search.aws.element84.com/v0

How's this relate to Forge?
My understanding so far about Forge is that it's about about developing a flexible system where users can define new processes, share those processes and control processing through an API. In Cirrus, a user can create a task (either Lambda or Batch) and workflows using that task easily enough, but there is no API for controlling processing, and Cirrus doesn't provide a capability of plug-ins or similar. To deploy Cirrus with additions you fork the repo and make additions there. I haven't gotten a detailed roadmap yet, but one of the ideas was to split out the various Lambdas, tasks and core pieces so that a user could more easily assemble workflows from building blocks they could get from multiple places. Which seems more like what Forge is about.

Thoughts? What are the possible touch points here?

The text was updated successfully, but these errors were encountered:

rabernat · 2021-01-22T21:40:48Z

Hi @matthewhanson and thanks for taking the time to share! Sorry no one got back to you for so long.

We just published a new documentation site that clarifies how Pangeo Forge will work: https://pangeo-forge.readthedocs.io/en/latest/index.html

I'll look into Cirrus and consider the questions you raised.

rabernat added the design question A question of the design of Pangeo Forge label Jan 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cirrus geospatial pipeline #12

Cirrus geospatial pipeline #12

matthewhanson commented Aug 31, 2020 •

edited

Loading

rabernat commented Jan 22, 2021

Cirrus geospatial pipeline #12

Cirrus geospatial pipeline #12

Comments

matthewhanson commented Aug 31, 2020 • edited Loading

rabernat commented Jan 22, 2021

matthewhanson commented Aug 31, 2020 •

edited

Loading