This is a docker-containerized ELT data pipeline to extract and load data in monthly batches into BigQuery Data Warehouse, where it is transformed and prepared for BI reporting.
The dataset comes from monthly TLC Trip Record Data for NYC Green and Yellow taxi trips spanning from 2019 - 2021. It consists of over 200 million records.
- Cloud - Google Cloud Platform
- Infrastructure as Code - Terraform
- Containerization - Docker, Docker Compose
- Batch Processing - Apache Spark
- Transformation - dbt
- Orchestration - Airflow
- Data Lake - Google Cloud Storage
- Data Warehouse - BigQuery
- Data Visualization - Tableau Desktop
- Language - Python
The dashboard is now live and can be accessed on Tableau Public. The SQL queries for cleaning and transformation steps are present in BigQuery-ELT/dbt/models
WARNING: You will be charged for all the infrastructure setup. You can avail 300$ in credit by creating a new account on Google Cloud Platform (GCP).
- Google Cloud Platform Account
- You have a GCP project with the project ID
<your-gcp-project-id>
- You have environment variable
GOOGLE_APPLICATION_CREDENTIALS
set, which points to your Google Service Account JSON file. The service account should have Storage Admin, Storage Object Admin, and BigQuery Admin permissions. gcloud
sdk installed. To authenticate, rungcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
- You have a GCP project with the project ID
- Terraform
- Docker, Docker Compose
The following steps will help you spin up the required infrastructure, which includes one GCS bucket and two BigQuery datasets
cd BigQuery-ELT/terraform
terraform init
You will be asked to enter two values. For the GCP Project ID, enter <your-gcp-project-id>
. For the GCS bucket, enter any name that is unqiue across GCS. A good idea is to concatenate desired name with your GCS Project ID to make it unique. We will refer to the entered bucket name as <your-gcs-bucket-name>
.
terraform plan
Terraform plan should show the plan for creating following services:
google_storage_bucket.bucket
- This is the data lake bucket, used to store raw parquet files in monthly batchesgoogle_bigquery_dataset.stg_dataset
- This is thestaging
dataset for raw datagoogle_bigquery_dataset.prod_dataset
- This is theprod
dataset for BI reporting
terraform apply
You should now see the above bucket and datasets in your GCS project.
Once done, set the environment variables GCP_PROJECT_ID
, GCP_GCS_BUCKET
to <your-gcp-project-id>
, <your-gcs-bucket-name>
respectively.
cd BigQuery-ELT/airflow
echo -e "AIRFLOW_UID=$(id -u)" > .env
docker compose up
At this point, the Airflow Webserver UI should be available on localhost:8080
. Don't forget to port-forward 8080
in case running on a VM. You can login with user:airflow
and password:airflow
and trigger the pipeline
DAG.
docker compose down
cd ../terraform/ && terraform destroy
- Add more data quality tests in dbt
- Include CI/CD