Skip to content

A collection of data engineering projects showcasing ETL pipelines, SQL optimization, cloud data processing, orchestration with Apache Airflow, and Power BI dashboards. This repository demonstrates skills in Python, SQL, PySpark, and Azure Data Services

Notifications You must be signed in to change notification settings

erictreacy/data-engineering-portfolio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

data-engineering-portfolio

A collection of data engineering projects showcasing ETL pipelines, SQL optimization, cloud data processing, orchestration with Apache Airflow, and Power BI dashboards. This repository demonstrates skills in Python, SQL, PySpark, and Azure Data Services

Data Engineering Portfolio

Overview

This repository contains a collection of data engineering projects demonstrating ETL pipelines, SQL optimization, cloud data processing, orchestration with Apache Airflow, and Power BI dashboards. The goal is to showcase practical skills in Python, SQL, PySpark, Azure, and Power BI for real-world data engineering workflows.

Projects

1. ETL Pipeline with Python & SQL

  • Goal: Extract, transform, and load data from an API/CSV into a database.
  • Tech Stack: Python, Pandas, PostgreSQL, Airflow.
  • Key Features:
    • Fetches and cleans data.
    • Loads data into a SQL database.
    • Automates pipeline execution with Apache Airflow.
  • Project Code

2. Power BI Dashboard with Cloud Data

  • Goal: Connect Power BI to a cloud database and create an interactive dashboard.
  • Tech Stack: Power BI, Azure Synapse, SQL.
  • Key Features:
    • Live data visualization.
    • DAX calculations for KPIs.
    • Automated data refresh.
  • Project Code

3. Cloud Data Engineering with Azure

  • Goal: Process and analyze big data using Azure Data Lake & PySpark.
  • Tech Stack: Azure Data Lake, Databricks, PySpark.
  • Key Features:
    • Stores large datasets in Azure Data Lake.
    • Uses PySpark for transformation.
    • Loads processed data into Azure Synapse.
  • Project Code

4. SQL Optimization & Performance Tuning

  • Goal: Improve query performance using indexing and partitioning.
  • Tech Stack: PostgreSQL, MySQL.
  • Key Features:
    • Benchmarks query execution time.
    • Implements indexing for optimization.
    • Compares before/after performance.
  • Project Code

5. Data Orchestration with Apache Airflow

  • Goal: Automate ETL workflows using Apache Airflow.
  • Tech Stack: Python, Airflow, Docker.
  • Key Features:
    • Uses Airflow DAGs to schedule ETL jobs.
    • Monitors pipeline execution.
    • Containerized using Docker.
  • Project Code

Technologies Used

  • Programming: Python, SQL, PySpark
  • Databases: PostgreSQL, MySQL, NoSQL
  • Cloud: Azure (Data Lake, Synapse, Databricks)
  • Orchestration: Apache Airflow, Azure Data Factory
  • Visualization: Power BI, DAX
  • Containerization: Docker, Kubernetes
  • Big Data Tools: Hadoop, Hive, Kafka

Setup Instructions

1. Clone the Repository

git clone https://github.com/erictreacy/data-engineering-portfolio.git
cd data-engineering-portfolio

About

A collection of data engineering projects showcasing ETL pipelines, SQL optimization, cloud data processing, orchestration with Apache Airflow, and Power BI dashboards. This repository demonstrates skills in Python, SQL, PySpark, and Azure Data Services

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published