Skip to content

This project provides a detailed overview of creating an automated data engineering pipeline using Airflow, AWS services, Spark, Snowflake and Tableau

Notifications You must be signed in to change notification settings

Siddhesh19991/Automate_EMR_ETL_pipeline_using_Airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automate_EMR_ETL_pipeline_using_Airflow

This project provides a detailed overview of creating an automated data engineering pipeline. It integrates Apache Airflow for workflow orchestration, utilizes Apache Spark on AWS EMR for large-scale data processing and employs Snowflake for data warehousing. Additionally, Tableau is used for creating visualizations to analyze the real estate market in the USA effectively.

The detailed blog can be found here.


To build the entire pipeline, here was the process:

  • Configuring the necessary AWS services
  • Setting up Airflow
  • Data Collection
  • Data Transformation using AWS EMR
  • Connecting All the Tasks to Create a DAG Pipeline
  • Data Warehousing using Snowflake
  • Visualization using Tableau

An overview of the complete pipeline: Screenshot 2024-08-02 at 2 54 28 PM

The final output of the dashboard created using Tableau:

Screenshot 2024-08-02 at 2 58 27 PM

For questions or feedback about the project, don't hesitate to reach out to me on LinkedIn.

About

This project provides a detailed overview of creating an automated data engineering pipeline using Airflow, AWS services, Spark, Snowflake and Tableau

Topics

Resources

Stars

Watchers

Forks