Skip to content

This project builds a cloud-based pipeline to extract NYC taxi data from an API and store it in Azure Data Lake Storage (ADLS). Databricks and PySpark are used to transform the data through the medallion architecture (Bronze → Silver → Gold). Delta Lake ensures reliable storage, and Power BI provides visual insights for data-driven decision-making.

Notifications You must be signed in to change notification settings

Maaztajmohammed/NYC-Taxi-Data-Engineering-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NYC-Taxi-Data-Engineering-Project

Overview

This project involves extracting NYC taxi data from an API and storing it in a data lake. The data is transformed using the medallion architecture in Databricks with PySpark, progressing through bronze, silver, and gold layers. Delta tables are created in Azure Data Lake Storage (ADLS) and integrated with Power BI for dynamic visualizations.

Data Architecture

Architecture

Lessons Learned

  1. Developed a Parameterized Data Pipeline to extract data dynamically from the website.
  2. Enabled Databricks to access ADLS using a service principal for secure, controlled access.
  3. Adopted the Medallion Architecture (Bronze, Silver, Gold Layers) for systematic data transformation and enrichment.
  4. Established Delta Lake for Efficient Data Storage and created Delta Tables over this data.
  5. Leveraged Data Versioning and Time Travel allowing historical data retrieval and rollback capabilities

About

This project builds a cloud-based pipeline to extract NYC taxi data from an API and store it in Azure Data Lake Storage (ADLS). Databricks and PySpark are used to transform the data through the medallion architecture (Bronze → Silver → Gold). Delta Lake ensures reliable storage, and Power BI provides visual insights for data-driven decision-making.

Topics

Resources

Stars

Watchers

Forks