Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cirrus geospatial pipeline #12

Open
matthewhanson opened this issue Aug 31, 2020 · 1 comment
Open

Cirrus geospatial pipeline #12

matthewhanson opened this issue Aug 31, 2020 · 1 comment
Labels
design question A question of the design of Pangeo Forge

Comments

@matthewhanson
Copy link

matthewhanson commented Aug 31, 2020

Hey all, @scottyhq mentioned this forge thing a few weeks ago, and @rsignell-usgs suggested I post here. We recently open-sourced Cirrus, an AWS pipeline for processing geospatial data. On the surface it sounds like this is similar to Pangeo forge but after some reading it seems like they are pretty different things, although there might be some synergistic efforts we could discuss.

What is Cirrus?
Cirrus is a mostly serverless AWS architecture that makes use of DynamoDB, Lambda, SNS, SQS, Step Functions, and Batch.
Cirrus is meant for scaling up, and doing both historical and ongoing real-time processing of geospatial data via STAC metadata and assets.

See the repo and docs on architecture and usage.

What can you use Cirrus for?
Cirrus has served as the backend processing architecture for a few projects. With it you can:

  • Fetch metadata from another source, transform to STAC and publish it
  • Fetch STAC metadata from an API, s3 bucket (or s3 inventory) and perform processing on assets such as copying, converting to COG, or generating preview images and thumbnails
  • Count how many jobs are in a state of PROCESSING, COMPLETED, or FAILED by datetime or collection via an API
  • Monitor and track jobs through the system, links provided to original input and complete Step Function execution logs for all tasks within a workflow
  • Rerun jobs based on last state, datetime, or collection
  • Publish STAC metadata via SNS to be consumed by something like STAC-server
  • Maintain data provenance when transforming data by including "derived_from" and "copied_from" links back to the source STAC metadata
  • Chain together Cirrus workflows by providing default processes for collections. Cirrus can consume it's own published data and assign a new workflow based on the the Collection the Item belongs to. e.g, L0->L1->L2

Sentinel-2 COG Public Dataset
I've used Cirrus to create the new Sentinel-2 COGs. This started off for just Africa, but we are currently more than half way through processing the entire global archive from the original JP2K format, about 6 million scenes. This past weekend I generated nearly 2 million converted Sentinel scenes using Cirrus. Each scene is 17 COGs, so that's 34 million file conversions in a couple days. All failures and successes tracked, so that errors can be identified and dealt with and subsequent runs later don't reprocess completed data.

First I indexed the existing AWS Public datasets: 20 million sentinel-s2-l1c and sentinel-s2-l2a scenes in JP2K, by converting the sentinel metadata (tileInfo.json) into STAC Items and indexing those. Then using that STAC API, get the L2A scenes and convert them to COGs, linking back to the L2A record for provenance, and publishing/indexing those back in Earth-Search (the sentinel-s2-l2a-cogs collection). All these steps for publishing, converting, copying, were done with Cirrus.

The COG archive is available through Earth-search: https://earth-search.aws.element84.com/v0

How's this relate to Forge?
My understanding so far about Forge is that it's about about developing a flexible system where users can define new processes, share those processes and control processing through an API. In Cirrus, a user can create a task (either Lambda or Batch) and workflows using that task easily enough, but there is no API for controlling processing, and Cirrus doesn't provide a capability of plug-ins or similar. To deploy Cirrus with additions you fork the repo and make additions there. I haven't gotten a detailed roadmap yet, but one of the ideas was to split out the various Lambdas, tasks and core pieces so that a user could more easily assemble workflows from building blocks they could get from multiple places. Which seems more like what Forge is about.

Thoughts? What are the possible touch points here?

@rabernat rabernat added the design question A question of the design of Pangeo Forge label Jan 22, 2021
@rabernat
Copy link
Contributor

Hi @matthewhanson and thanks for taking the time to share! Sorry no one got back to you for so long.

We just published a new documentation site that clarifies how Pangeo Forge will work: https://pangeo-forge.readthedocs.io/en/latest/index.html

I'll look into Cirrus and consider the questions you raised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design question A question of the design of Pangeo Forge
Projects
None yet
Development

No branches or pull requests

2 participants