This repository contains an ETL (Extract, Transform, Load) data pipeline built using Apache Airflow, Python, and Docker. The pipeline extracts COVID-19 data from the European Centre for Disease Prevention and Control (ECDC), transforms it, and loads it into a PostgreSQL database.
This project demonstrates a complete ETL workflow orchestrated by Apache Airflow and containerized with Docker. The pipeline processes COVID-19 case and death data for European countries, providing insights into the pandemic's progression.
The project is built with the following components:
- Apache Airflow: Workflow orchestration platform
- Python: Programming language for data processing
- Docker & Docker Compose: Containerization for consistent environments
- PostgreSQL: Database for storing processed data
- Psycopg2: PostgreSQL adapter for Python
The data is extracted from the European Centre for Disease Prevention and Control (ECDC), which provides COVID-19 statistics, including:
- Daily case counts
- Death counts
- Country-specific data
- Date information
The pipeline specifically targets the publicly available COVID-19 dataset from ECDC, which is accessed via their API.
ETL_pipeline_airflow-python-docker/
├── dags/
│ ├── covid_pipeline_dag.py # Main Airflow DAG definition
│ └── utils/
│ ├── countries.py # Country code utilities
│ ├── dates.py # Date handling functions
│ └── transformers.py # Data transformation logic
├── docker-compose.yml # Docker configuration
├── Dockerfile # Docker image definition
├── requirements.txt # Python dependencies
└── scripts/
├── init.sh # Initialization script
└── entrypoint.sh # Docker entrypoint
The ETL pipeline consists of the following steps:
- Extract: Fetch COVID-19 data from ECDC API
- Transform: Clean and process the data
- Filter European countries
- Format dates
- Calculate additional metrics
- Load: Store processed data in PostgreSQL database
- Visualize: Generate basic visualizations (if enabled)
- Docker and Docker Compose
- Git
-
Clone the repository:
git clone https://github.com/No0Bitah/ETL_pipeline_airflow-python-docker.git cd ETL_pipeline_airflow-python-docker -
Build and start the containers:
docker-compose up -d -
Access Airflow web interface:
http://localhost:8080Default credentials:
- Username: airflow
- Password: airflow
-
Trigger the DAG manually or wait for the scheduled run
The project can be configured by modifying:
- Environment variables in the
docker-compose.ymlfile - DAG parameters in
dags/covid_pipeline_dag.py
The processed COVID-19 data is stored in a table with the following structure:
| Column | Type | Description |
|---|---|---|
| country_code | VARCHAR | ISO country code |
| country | VARCHAR | Country name |
| date | DATE | Report date |
| cases | INTEGER | Daily reported cases |
| deaths | INTEGER | Daily reported deaths |
| cumulative_cases | INTEGER | Total cases up to date |
| cumulative_deaths | INTEGER | Total deaths up to date |
| updated_at | TIMESTAMP | Last update timestamp |
This project is available under the MIT License.
- European Centre for Disease Prevention and Control for providing the COVID-19 data
- Apache Airflow community
- Docker community