EcoPulse: End-to-End Data Pipeline for Economic Insights

Project Overview

EcoPulse is a cloud-native data engineering project designed to analyze key economic indicators through an end-to-end batch data pipeline. The pipeline automates the ingestion, processing, and storage of economic data in a structured, query-optimized format. This solution provides seamless analysis through an interactive dashboard, enabling data-driven decision-making across industries such as finance, policy, and business strategy.

🎯 Problem Statement

In today's fast-paced world, monitoring and analyzing key economic indicators—such as stock market, inflation, housing prices, and so on— is critical for making informed decisions in finance, policymaking, and business. However, gathering this data can often be a tedious and time-consuming task, especially when relying on manually pulling information from various sources.

To address this problem, EcoPulse automates the workflow by:

Extracting economic data from the Federal Reserve Economic Data (FRED) Python API.
Storing raw data in GCS as a data lake.
Processing data with Apache Spark on Dataproc.
Storing structured, partitioned, and clustered data in BigQuery for analytics.
Visualizing insights through an interactive dashboard in Looker Studio.

Here are the key economic indicators processed:

Financial Markets: S&P 500 Index, 10-Year Treasury Yield, VIX (Volatility Index)
Interest Rates: Federal Funds Rate
Inflation & Price Levels: Consumer Price Index (CPI-U)
Labor Market: Labor Force Participation Rate
Economic Activity: Industrial Production Index
Housing Market: House Price Index (Case-Shiller National Home Price Index)

🏗️ Architecture and Highlights

Solution Highlights:

✅ Cloud-Native & Infrastructure as Code (IaC): leverages cloud-based services to ensure scalability and reliability. Terraform is used as IaC tool to automate cloud resource provisioning.

✅ Batch Data Pipeline with Workflow Orchestration: follows a structured batch processing workflow, orchestrated using Kestra to automate data ingestion and load to GCS data lake storage.

✅ Optimized Data Warehouse: The data warehouse is structured in BigQuery, where tables are partitioned and clustered to enhance query performance and minimize costs.

✅ Transformations with Spark: The data undergoes transformations using Apache Spark and Dataproc Cluster, ensuring efficient handling of large datasets and preparing them for downstream analytics.

✅ Interactive Dashboard: built using Looker Studio to provide insights into economic trends.

🚀 Infrastructure Setup with Terraform

EcoPulse leverages Terraform to provision cloud resources efficiently. This ensures infrastructure as code (IaC) best practices, making deployments reproducible, scalable, and maintainable.

🛠️ Prerequisites

Terraform
Google Cloud SDK (gcloud)
A GCP Service Account with the required permissions

🔑 Authentication

Authenticate using your GCP Service Account:

source Terraform/setup.sh

🚀 Deploying Infrastructure

cd Terraform
./deploy.sh

🔄 Ingestion with Kestra & Python

EcoPulse leverages Kestra for workflow orchestration, automating data retrieval via the FRED API and uploading it to GCS.

🛠️ Prerequisites

Docker & Docker Compose
A FRED API Key (Get one from the FRED website)
A GCP Service Account with the required permissions

📌 Start Kestra

Run the following command to start Kestra using Docker Compose:

cd Kestra
docker-compose up

After Kestra UI is loaded, we can run two following flows:

🔑 set_kv: Configures Environment Variables

The flow (set_kv.yaml) configure the following project variables:

gcp_project_id
gcp_location
gcp_bucket_name

⚡ data_load_gcs: Fetches data, saves as CSV, and uploads to GCS

The data_load_gcs.yaml flow orchestrates the entire ingestion pipeline:

Fetches data from the FRED API in Python and saves as CSV files
Uploads the CSVs to the specified GCS bucket.
Purges temporary files to keep the workflow clean.

🔑 Note that the service account creds is configured using a secret and the FRED API Key was set through the KV Store.

Kestra Topology Diagram:

⚡ Processing with Spark

EcoPulse leverages Apache Spark for scalable and efficient data transformations. Spark processes economic data stored in GCS bucket, transforms it, and then loads the transformed data into BigQuery for downstream analysis.

🛠️ Prerequisites

Depending on the Spark environment (local vs cloud), ensure you have the following installed:

Running Locally:

Python 3.12 (or your preferred version)
Apache Spark

Running through GCP Dataproc Cluster:

Google Cloud SDK (gcloud)
A service Account with the required permissions
Required JARs for Spark on Dataproc:
- gcs-connector-hadoop3-latest.jar
- spark-bigquery-with-dependencies_2.12-0.24.0.jar

🔄 Transformation Steps

The Spark transformation job performs the following steps:

Load raw data from GCS into Spark dataframe
Filter each dataframe to last 10 years of economic data
For the daily level series (SP500, DGS10, VIXCLS, EFFR), merge using the date column.
Add a categorical column that indicates the level of daily change on the SP500 index.
Add Month, Year and Year-Month columns.
Similarly, process the monthly level series (Step 3 and 5).
Merge daily and monthly data on Year-Month for final processed table.
Load transformed data into BigQuery as the data warehouse.

🚀 Running the PySpark Job through Dataproc Cluster

Step 1: Create a dataproc cluster through create_dataproc_cluster.sh:

cd Spark/Dataproc Scripts
chmod +x create_dataproc_cluster.sh
./create_dataproc_cluster.sh

Step 2: Submit Spark job to Dataproc through submit_dataproc_job.sh:

chmod +x submit_dataproc_job.sh
./submit_dataproc_job.sh

🏛️ Data Warehousing in BigQuery

EcoPulse leverages BigQuery as the cloud data warehouse, ensuring efficient storage and analytical querying of economic data. The data is structured, partitioned/clustered, and optimized to support high-performance queries while minimizing costs.

🏗️ Schema Optimization: Partitioning & Clustering

To enhance query performance and reduce costs, the main dataset is:

✅ Partitioned by date → Improves query efficiency by filtering certain range of dates
✅ Clustered (optional for further optimization) → Groups data by categorical column SP500_daily_change_category

📌 Creating the Partitioned Table

Option 1: Use the bq CLI:

cd BigQuery
bq query --use_legacy_sql=false -q < create_partitioned_table.sql

Option 2: Use Query Editor in BigQuery console and run the SQL query in create_partitioned_table.sql.

📌 Creating the Partitioned and Clustered Table

Option 1: Use the bq CLI:

cd BigQuery
bq query --use_legacy_sql=false -q < create_partitioned_clustered_table.sql

Option 2: Use Query Editor in BigQuery console and run the SQL query in create_partitioned_clustered_table.sql.

🔹How Partitioning & Clustering Work Together

Partitioning (date) → First, BigQuery divides the data into partitions based on the date column.
Clustering (SP500_daily_change_category) → Within each partition, BigQuery physically sorts and groups data by SP500_daily_change_category.
Query Optimization:
- Queries filtering by date scan only relevant partitions.
- Queries filtering by SP500_daily_change_category within a partition run even faster due to clustering.

📊 Data Vizualization in Looker Studio

EcoPulse leverages Looker Studio to create an interactive dashboard that visualizes key economic indicators. The dashboard provides near real-time insights into economic trends, enabling data-driven decision-making.

🛠️ Prerequisites

Before building the dashboard, ensure you have:

A Google Cloud account with access to BigQuery.
A BigQuery dataset (ecopulse_bq_dw) containing processed economic data that Looker Studio can connect to.

📈 Dashboard Design & Visualizations

Date Range Controls
Daily Trending of Financial Market Signals
Distribution of SP500 Daily Change Category
Monthly Trending of Inflation and Housing Signals
Table of Labor Market and Economic Activity Signals

Link to the Looker Studio Dashboard here.

🔮 Future Improvement Opportunities

Add additional economic indicators (e.g., international trade data)
Automate Spark job execution via Kestra
Implement scheduled workflows for daily updates
Expand dashboard visualizations with predictive analytics
Replicate the architecture on Azure and AWS

🎉 Acknowledgments

EcoPulse is built using:

FRED API for the source of economic data
Google Cloud Platform (GCS, BigQuery, Dataproc)
Kestra for workflow orchestration
Apache Spark for scalable data transformations
Looker Studio for visualization

This project is created as part of the Data Engineering Zoomcamp 2025 course. Special thanks to DataTalkClub for the learning opportunity.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
BigQuery		BigQuery
Kestra		Kestra
Spark		Spark
Terraform		Terraform
images		images
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

EcoPulse: End-to-End Data Pipeline for Economic Insights

Project Overview

🎯 Problem Statement

🏗️ Architecture and Highlights

Solution Highlights:

🚀 Infrastructure Setup with Terraform

🛠️ Prerequisites

🔑 Authentication

🚀 Deploying Infrastructure

🔄 Ingestion with Kestra & Python

🛠️ Prerequisites

📌 Start Kestra

🔑 set_kv: Configures Environment Variables

⚡ data_load_gcs: Fetches data, saves as CSV, and uploads to GCS

⚡ Processing with Spark

🛠️ Prerequisites

Running Locally:

Running through GCP Dataproc Cluster:

🔄 Transformation Steps

🚀 Running the PySpark Job through Dataproc Cluster

🏛️ Data Warehousing in BigQuery

🏗️ Schema Optimization: Partitioning & Clustering

📌 Creating the Partitioned Table

📌 Creating the Partitioned and Clustered Table

🔹How Partitioning & Clustering Work Together

📊 Data Vizualization in Looker Studio

🛠️ Prerequisites

📈 Dashboard Design & Visualizations

🔮 Future Improvement Opportunities

🎉 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages