Skip to content

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra.

Notifications You must be signed in to change notification settings

tranhuy25/e2e-data-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Realtime Data Streaming | End-to-End Data Engineering Project

Table of Contents

Introduction

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. Everything is containerized using Docker for ease of deployment and scalability.

The project is designed with the following components:

  • Data Source: We use randomuser.me API to generate random user data for our pipeline.
  • Apache Airflow: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.
  • Apache Kafka and Zookeeper: Used for streaming data from PostgreSQL to the processing engine.
  • Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
  • Apache Spark: For data processing with its master and worker nodes.
  • Cassandra: Where the processed data will be stored.

What You'll Learn

  • Setting up a data pipeline with Apache Airflow
  • Real-time data streaming with Apache Kafka
  • Distributed synchronization with Apache Zookeeper
  • Data processing techniques with Apache Spark
  • Data storage solutions with Cassandra and PostgreSQL
  • Containerizing your entire data engineering setup with Docker

Technologies

  • Apache Airflow
  • Python
  • Apache Kafka
  • Apache Zookeeper
  • Apache Spark
  • Cassandra
  • PostgreSQL
  • Docker

Getting Started

  1. Clone the repository:

    git clone https://github.com/tranhuy25/e2e-data-engineering.git
  2. Navigate to the project directory:

    cd End-to-End Data Engineering
  3. Run Docker Compose to spin up the services:

    docker-compose up

About

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published