A scalable, cloud-based geospatial data processing pipeline for ingesting and processing CHIRPS climate data in it's initial instance and MODIS satellite imagery using Azure Batch and Azure GeoCatalog. This project enables automated conversion of geospatial raster data to Cloud Optimized GeoTIFFs (COGs) with STAC (SpatioTemporal Asset Catalog) metadata generation for efficient discovery and visualization.
This project supports two distinct data ingestion workflows:
A straightforward search-and-load process:
- Search - Query Microsoft Planetary Computer's STAC catalog for available data
- Validate - Check and fix STAC metadata if needed
- Ingest - Directly ingest STAC items into your Azure GeoCatalog
A full processing pipeline for raw geospatial data:
- Extract - Scrape CHIRPS precipitation data URLs from UCSB
- Download - Retrieve raw GeoTIFF files
- Convert to COGs - Transform to Cloud Optimized GeoTIFFs using Azure Batch
- Generate STAC - Create STAC metadata items with proper geospatial indexing
- Ingest - Upload to Azure GeoCatalog for visualization and discovery
This workflow leverages Azure Batch for distributed processing of thousands of files.
- Two Ingestion Modes: Simple direct ingestion from Planetary Computer OR complex batch processing for raw data
- Scalable Processing: Azure Batch integration for distributed processing of large datasets (CHIRPS pipeline)
- Cloud-Optimized: Generates COGs for efficient cloud-native geospatial workflows (CHIRPS pipeline)
- STAC Compliant: Full STAC metadata generation and validation for both workflows
- Automated Ingestion: Direct integration with Microsoft Planetary Computer and Azure GeoCatalog
- Progress Monitoring: Real-time tracking of batch jobs (CHIRPS pipeline) and ingestion status
- Error Handling: Automatic validation and correction of STAC metadata errors
- Flexible Configuration: Support for multiple regions, date ranges, and data sources
The system is organized into modular components:
- COG Creation Pipeline (
src/cog_creation/) - Extract and convert raw data to COGs - STAC Creation Pipeline (
src/stac_creation/) - Generate STAC metadata and catalog - Ingestion (
src/ingestion/) - Import data from external sources (Planetary Computer) - Shared Utilities (
src/utils/) - Common Azure Batch and Storage operations
WORKFLOW 1: Direct Ingestion from Planetary Computer
┌─────────────────────────┐
│ Microsoft Planetary │
│ Computer │
│ (STAC Catalog) │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Search STAC API │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Validate Items │
│ (Fix metadata) │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Azure GeoCatalog │
│ (Direct Ingestion) │
└─────────────────────────┘
WORKFLOW 2: CHIRPS Data Processing Pipeline
┌─────────────────────────┐
│ CHIRPS Data │
│ (UCSB) │
└───────────┬─────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Azure Batch Operation 1 │
│ (Extract URLs → Download TIFFs → Convert to COGs) │
└───────────┬─────────────────────────────────────────────┘
│
▼
┌─────────────────────────┐
│ Azure Blob Storage │
│ (processed-cogs) │
└───────────┬─────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Azure Batch Operation 2 │
│ (Generate STAC Items) │
└───────────┬─────────────────────────────────────────────┘
│
▼
┌─────────────────────────┐
│ Azure Blob Storage │
│ (stac-items) │
└───────────┬─────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Bulk Ingestion Script │
└───────────┬─────────────────────────────────────────────┘
│
▼
┌─────────────────────────┐
│ Azure GeoCatalog │
│ (Bulk Ingestion) │
└─────────────────────────┘
For Both Workflows:
- Azure subscription with Azure GeoCatalog instance
- Python 3.11+
- Conda environment manager
- Azure CLI
For CHIRPS Pipeline Only (Workflow 2):
- Azure Batch account
- Azure Storage account with blob containers
- Azure Key Vault
- Service principal with appropriate permissions
-
Clone the repository:
git clone <repository-url> cd MPCP_lassa_sentinel
-
Create and activate conda environment:
conda create -n MPCP_lassasentinel python=3.11 -y conda activate MPCP_lassasentinel pip install -r requirements.txt
-
Configure environment variables:
cp configs/config.env.example configs/config.env # Edit config.env with your Azure credentials source configs/config.env
python src/ingestion/ingestion_from_datacatalog.py \
--geocatalog-url "https://your-catalog.geocatalog.spatio.azure.com" \
--pc-collection "modis-13Q1-061" \
--bbox-aoi 2.316388 3.837669 15.126447 14.153350 \
--date-range "2020-01-01/2020-12-31" \
--region "nigeria" \
--batch-size 100python src/cog_creation/batch_job_creator.py
python src/cog_creation/progress_monitor.pypython src/stac_creation/orchestrate_catalog_ingestion.py \
--geocatalog-uri "https://your-catalog.geocatalog.spatio.azure.com" \
--collection-id "Nigeria-CHIRPS" \
--collection-title "Nigeria CHIRPS Collection" \
--collection-desc "CHIRPS v2.0 precipitation data for Nigeria" \
--bbox 2.316388 3.837669 15.126447 14.153350MPCP_lassa_sentinel/
├── src/
│ ├── cog_creation/ # COG processing pipeline
│ │ ├── batch_job_creator.py
│ │ ├── batch_task_runner.py
│ │ ├── progress_monitor.py
│ │ ├── data_extraction.py
│ │ └── processing.py
│ ├── stac_creation/ # STAC metadata generation
│ │ ├── batch_job_creator.py
│ │ ├── batch_task_runner.py
│ │ ├── progress_monitor.py
│ │ ├── stac_conversion.py
│ │ ├── define_catalog.py
│ │ └── orchestrate_catalog_ingestion.py
│ ├── ingestion/ # External data ingestion
│ │ └── ingestion_from_datacatalog.py
│ └── utils/ # Shared utilities
│ ├── azure_batch_utils.py
│ ├── azure_storage_utils.py
│ ├── base_progress_monitor.py
│ └── batch_task_utils.py
├── configs/ # Configuration files
├── docs/ # Comprehensive documentation
│ ├── SETUP.md
│ ├── BATCH_AND_PROCESSING.md
│ ├── TESTING.md
│ └── TROUBLESHOOTING.md
├── test_suite/ # Unit and integration tests
└── notebooks/ # Jupyter notebooks for exploration
Comprehensive guides are available in the docs/ directory:
- SETUP.md - Complete setup instructions for Azure, GeoCatalog, Storage, and Authentication
- BATCH_AND_PROCESSING.md - Azure Batch configuration, ingestion sources, and processing pipelines
- TESTING.md - Running unit and integration tests with Azurite
- TROUBLESHOOTING.md - Common issues and solutions
Description: Search Microsoft Planetary Computer's STAC catalog and directly ingest items into your Azure GeoCatalog. This workflow is lightweight and doesn't require Azure Batch or COG conversion since Planetary Computer data is already cloud-optimized.
Use Cases:
- Ingesting MODIS data (already in COG format)
- Loading Sentinel-2, Landsat, or other Planetary Computer collections
- Quick data availability checks for specific regions/dates
Key Files:
src/ingestion/ingestion_from_datacatalog.py- MPC to GeoCatalog ingestion with STAC validation
Azure Resources Needed:
- Azure GeoCatalog instance only (no Batch or Storage required)
Description: Full processing pipeline for raw CHIRPS precipitation data. Extracts data from UCSB, converts to COGs using Azure Batch for parallel processing, generates STAC metadata, and ingests into GeoCatalog.
Use Cases:
- Processing CHIRPS precipitation data from UCSB
- Converting any raw GeoTIFF data to cloud-optimized format
- Large-scale batch processing of thousands of files
Pipeline Stages:
Extracts CHIRPS precipitation data URLs from UCSB and converts them to Cloud Optimized GeoTIFFs using Azure Batch for parallel processing.
Key Files:
src/cog_creation/data_extraction.py- Web scraping CHIRPS URLssrc/cog_creation/batch_job_creator.py- Batch job orchestrationsrc/cog_creation/batch_task_runner.py- Executed on Batch nodessrc/cog_creation/progress_monitor.py- Real-time progress tracking
Converts COG files to STAC items with proper metadata including spatial extent, temporal coverage, and asset references.
Key Files:
src/stac_creation/stac_conversion.py- COG to STAC conversionsrc/stac_creation/batch_job_creator.py- Batch job for STAC generationsrc/stac_creation/batch_task_runner.py- Executed on Batch nodessrc/stac_creation/define_catalog.py- STAC catalog generationsrc/stac_creation/orchestrate_catalog_ingestion.py- Catalog creation and bulk ingestion
Azure Resources Needed:
- Azure GeoCatalog instance
- Azure Batch account (for distributed processing)
- Azure Storage account (5 containers)
- Azure Key Vault
- Service principal
Note: Storage containers are only required for Workflow 2 (CHIRPS Pipeline). Workflow 1 (Planetary Computer) doesn't need Azure Storage.
The CHIRPS processing pipeline uses five Azure Blob Storage containers:
raw-data- Original downloaded TIFF filesprocessed-cogs- Cloud Optimized GeoTIFFsstac-items- STAC JSON metadata filestask-data- Work item JSON files for batch tasksbatch-logs- Progress tracking files
Run the test suite from the test_suite/ directory:
cd test_suite
# All tests with coverage
make test-all
# Unit tests only (fast)
make test-unit
# Integration tests (requires Azurite)
make test-integrationSee TESTING.md for detailed testing instructions.
For Workflow 1 (Planetary Computer Ingestion):
- Azure GeoCatalog instance (canadacentral, northcentralus, or westeurope)
For Workflow 2 (CHIRPS Pipeline) - Additional Requirements:
- Azure Batch account (User Subscription mode)
- Azure Storage account with 5 blob containers
- Azure Key Vault
- Service principal with appropriate permissions
Key dependencies (see requirements.txt for full list):
azure-storage-blob>=12.19.0azure-batch>=14.0.0azure-identity>=1.15.0rasterio>=1.3.0pystac>=1.9.0planetary-computer>=1.0.0requests>=2.31.0beautifulsoup4>=4.12.0
Environment variables are stored in configs/config.env:
# Azure Batch
export AZURE_TENANT_ID="your-tenant-id"
export AZURE_CLIENT_ID="your-client-id"
export AZURE_CLIENT_SECRET="your-client-secret"
export BATCH_ACCOUNT_URL="https://your-batch.batch.azure.com"
# Azure Storage
export STORAGE_ACCOUNT_URL="https://your-storage.blob.core.windows.net"
export BATCH_STORAGE_ACCOUNT_KEY="your-storage-key"See SETUP.md for detailed configuration instructions.
Common issues and solutions are documented in TROUBLESHOOTING.md, including:
- GeoCatalog location and naming requirements
- Azure Batch authentication issues
- Storage access and SAS token errors
- STAC validation failures
- Performance optimization tips
- Visualization and rendering issues
For issues or questions:
- Check the Troubleshooting Guide
- Review Setup Documentation
- Consult Azure documentation:
- Check Azure service health: https://status.azure.com/