Skip to content

This project automates daily vehicle data processing on Google Cloud using Apache Airflow. It uploads scripts to Google Cloud Storage, runs specific PySpark jobs on Dataproc based on the day, and shuts down resources when done for efficiency.

Notifications You must be signed in to change notification settings

Undisputed-jay/Airflow-ETL-Pipeline-with-PySpark-and-Google-Cloud-Dataproc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Airflow ETL Pipeline with PySpark and Google Cloud Dataproc

This project is an ETL pipeline using Apache Airflow to automate data processing tasks on Google Cloud Platform (GCP). The pipeline performs daily analyses on vehicle data, dynamically handling weekday and weekend processing using Google Cloud Dataproc clusters and storing results in Google Cloud Storage (GCS). The DAG uploads PySpark scripts to GCS, creates a Dataproc cluster, processes data based on the day type, and cleans up resources post-processing.

Table of Contents

Overview

The pipeline automates data ingestion and processing tasks in GCP, leveraging Airflow's scheduling capabilities and GCP services such as Dataproc and GCS. The pipeline branches based on whether it's a weekday or weekend, running separate Spark jobs accordingly.

Key Features

  • Dynamic Task Scheduling: Configured to run daily, adjusting tasks based on the day of the week.
  • PySpark Workflows: Processes data with specific PySpark scripts for weekend and weekday analyses.
  • Google Cloud Integration: Utilizes GCS for storage and Dataproc for Spark job execution.

Key Components

  • Airflow DAG: Defines the workflow, with tasks including file upload, Dataproc cluster management, and job submission.
  • Branching Logic: A BranchPythonOperator determines if it’s a weekday or weekend, allowing for targeted analyses.
  • Dataproc: Dynamically created and deleted for job execution, optimized for cost efficiency.
  • GCS: Stores input scripts and job outputs.

Pipeline Structure

  • Upload Tasks: Uploads all local PySpark scripts from the specified directory to a GCS bucket.
  • Cluster Creation: Creates a Dataproc cluster to run PySpark jobs.
  • Branching: Uses a BranchPythonOperator to identify weekday/weekend, routing to the appropriate jobs.
  • Data Processing: Executes PySpark jobs based on day type:
    • Weekday Jobs: Runs multiple scripts for metrics like average speed, temperature, and tire pressure.
    • Weekend Job: Analyzes gas composition data.
  • Cluster Deletion: Deletes the Dataproc cluster post-job completion, freeing resources.

Prerequisites

  • Google Cloud Project with enabled Dataproc, GCS, and IAM permissions.
  • Apache Airflow environment configured with GCP connections.
  • Python 3.7+

Airflow Connections

Set up the following connections in your Airflow environment:

  • gcp_default: Connection to GCP for Dataproc and GCS.
  • airflow-bigquery-project: Bucket for storing PySpark scripts and output data.

Usage

  • Place PySpark scripts in the local directory specified (/usr/local/airflow/include/data/pyspark).
  • Deploy the DAG file to your Airflow environment.
  • Trigger the DAG or set it to run on a schedule.
  • Review logs and outputs in the Airflow UI and GCS bucket.

About

This project automates daily vehicle data processing on Google Cloud using Apache Airflow. It uploads scripts to Google Cloud Storage, runs specific PySpark jobs on Dataproc based on the day, and shuts down resources when done for efficiency.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published