This project analyses the No. of cases and Deaths due to covid 19 in year 2020 and transforms the Population data for further use. The Azure Components used here are :
- Azure Data Lake Storage Gen 2
- Azure Blob Storage
- Azure Data Factory
- Azure DataBricks
- Azure SQL Database
- Azure Service Principal
- Implement a for-loop to fetch each file from https://github.com/SharadChoudhury/Azure_Covid19_Analysis/raw/ecdc/main using the ecdc_file_list.json
- Store the ingested files in raw/ecdc folder in ADLS.
- Store the the raw population file in population_raw/BLOB in Azure Blob container
- Implement a pipeline to fetch the raw file from the Blob container if it exists, then fetch its metadata and if the column count matches the required column count, then copy the file to ADLS.
- Create a dataflow to process the Cases and Deaths file as per the below requirements and store the processed sink file in ADLS.
- Create a dataflow to process the Hospital admissions file as per the below requirements and store the processed sink file in ADLS.
- Create a ADF pipeline that runs the Databricks Notebook for population file transformation and store the processed file in ADLS.
Create a Master pipeline that runs both the child pipelines : 1. Ingesting Population data, 2. Transforming Population data using Databricks
- This pipeline should get triggered when the blob for raw population file is created
- Run SQL scripts in Azure SQL Database to create the table schemas in your database.
- Create pipelines with Copy activity to copy the data from processed Cases and Deaths file, Hospital Admissions file to respective tables in the SQL database.