Aviation provides the only rapid worldwide transportation network, which makes it essential for global business. It generates economic growth, creates jobs, and facilitates international trade and tourism. The air transport industry supports a total of 65.5 million jobs globally. It provides 10.2 million direct jobs.
This project aims to analyse the impact of Covid-19 on the aviation industry. It also provided a great opportunity to develop skills and experience in a range of tools such as Apache Airflow, Apache Spark, Tableau and some of the AWS cloud services.
Airflow orchestrates the following tasks:
- Upload the data and scripts from local machine to S3 bucket
- Provision an EMR cluster
- Submit a spark job to EMR cluster that executes the ETL workflow
- Wait for the spark submission to complete
- Terminate the EMR cluster
-
The data in this dataset is derived and cleaned from the full OpenSky dataset to illustrate the development of air traffic during the COVID-19 pandemic. It spans all flights seen by the network's more than 2500 members since 1 January 2019.
In order to avoid the out-of-memory issue, data is incrementally loaded into the s3 bucket.
Martin Strohmeier, Xavier Olive, Jannis Lübbe, Matthias Schäfer, and Vincent Lenders
"Crowdsourced air traffic data from the OpenSky Network 2019–2020"
Earth System Science Data 13(2), 2021
-
This dataset includes time-series data tracking the number of people affected by COVID-19 worldwide
-
The data is in CSV and contains the list of all airport codes.
-
The dataset contains country names (official short names in English) in alphabetical order as given in ISO 3166-1 and the corresponding ISO 3166-1-alpha-2 code elements. [ISO 3166-1]
-
The dataset was used to map countries with continents.
-
Contains ISO-3 codes and names of Indian States.
Find the entire analysis here
- Docker with at least 4GB of RAM and Docker Compose v1.27.0 or later
- AWS account
- AWS CLI installed and configured
- Tableau Desktop
Clone and cd into the project directory.
git clone <https://github.com/siddharth271101/Covid-19-and-Aviation-Industry.git>
cd beginner_de_project
Note: Replace {your-bucket-name} in setup.sh
, covid_flights_etl.py
and covid_flights_dag.py
before proceeding with the steps mentioned below.
-
Create a virtual environment
-
Once the virtual environment is activated, run the following command
$ pip install -r requirements.txt
Download the data and create an s3 bucket by running setup.sh
as shown below
sh setup.h
setup.sh
also starts to incrementally load the opensky data to the S3 bucket.
After setup.sh
runs successfully, start the docker container using the following command
docker compose -f docker-compose-LocalExecutor.yml up -d
We use the following docker containers -
- Airflow
- Postgres DB (as Airflow metadata DB)
Open the Airflow UI by hitting http://localhost:8080 in browser, start the covid_flights_dag DAG.
Once the dag-run is successful, check the output folder of the S3 bucket.
This blog explains the steps in detail to build a Tableau dashboard using Athena as a data source.