EcoPulse is a cloud-native data engineering project designed to analyze key economic indicators through an end-to-end batch data pipeline. The pipeline automates the ingestion, processing, and storage of economic data in a structured, query-optimized format. This solution provides seamless analysis through an interactive dashboard, enabling data-driven decision-making across industries such as finance, policy, and business strategy.
In today's fast-paced world, monitoring and analyzing key economic indicatorsโsuch as stock market, inflation, housing prices, and so onโ is critical for making informed decisions in finance, policymaking, and business. However, gathering this data can often be a tedious and time-consuming task, especially when relying on manually pulling information from various sources.
To address this problem, EcoPulse automates the workflow by:
-
Extracting economic data from the Federal Reserve Economic Data (FRED) Python API.
-
Storing raw data in GCS as a data lake.
-
Processing data with Apache Spark on Dataproc.
-
Storing structured, partitioned, and clustered data in BigQuery for analytics.
-
Visualizing insights through an interactive dashboard in Looker Studio.
Here are the key economic indicators processed:
- Financial Markets: S&P 500 Index, 10-Year Treasury Yield, VIX (Volatility Index)
- Interest Rates: Federal Funds Rate
- Inflation & Price Levels: Consumer Price Index (CPI-U)
- Labor Market: Labor Force Participation Rate
- Economic Activity: Industrial Production Index
- Housing Market: House Price Index (Case-Shiller National Home Price Index)
โ Cloud-Native & Infrastructure as Code (IaC): leverages cloud-based services to ensure scalability and reliability. Terraform is used as IaC tool to automate cloud resource provisioning.
โ Batch Data Pipeline with Workflow Orchestration: follows a structured batch processing workflow, orchestrated using Kestra to automate data ingestion and load to GCS data lake storage.
โ Optimized Data Warehouse: The data warehouse is structured in BigQuery, where tables are partitioned and clustered to enhance query performance and minimize costs.
โ Transformations with Spark: The data undergoes transformations using Apache Spark and Dataproc Cluster, ensuring efficient handling of large datasets and preparing them for downstream analytics.
โ Interactive Dashboard: built using Looker Studio to provide insights into economic trends.
EcoPulse leverages Terraform to provision cloud resources efficiently. This ensures infrastructure as code (IaC) best practices, making deployments reproducible, scalable, and maintainable.
- Terraform
- Google Cloud SDK (
gcloud) - A GCP Service Account with the required permissions
Authenticate using your GCP Service Account:
source Terraform/setup.shcd Terraform
./deploy.shEcoPulse leverages Kestra for workflow orchestration, automating data retrieval via the FRED API and uploading it to GCS.
- Docker & Docker Compose
- A FRED API Key (Get one from the FRED website)
- A GCP Service Account with the required permissions
Run the following command to start Kestra using Docker Compose:
cd Kestra
docker-compose upAfter Kestra UI is loaded, we can run two following flows:
The flow (set_kv.yaml) configure the following project variables:
gcp_project_idgcp_locationgcp_bucket_name
The data_load_gcs.yaml flow orchestrates the entire ingestion pipeline:
- Fetches data from the FRED API in Python and saves as CSV files
- Uploads the CSVs to the specified GCS bucket.
- Purges temporary files to keep the workflow clean.
๐ Note that the service account creds is configured using a secret and the FRED API Key was set through the KV Store.
Kestra Topology Diagram:
EcoPulse leverages Apache Spark for scalable and efficient data transformations. Spark processes economic data stored in GCS bucket, transforms it, and then loads the transformed data into BigQuery for downstream analysis.
Depending on the Spark environment (local vs cloud), ensure you have the following installed:
- Python 3.12 (or your preferred version)
- Apache Spark
- Google Cloud SDK (
gcloud) - A service Account with the required permissions
- Required JARs for Spark on Dataproc:
- gcs-connector-hadoop3-latest.jar
- spark-bigquery-with-dependencies_2.12-0.24.0.jar
The Spark transformation job performs the following steps:
- Load raw data from GCS into Spark dataframe
- Filter each dataframe to last 10 years of economic data
- For the daily level series (
SP500,DGS10,VIXCLS,EFFR), merge using the date column. - Add a categorical column that indicates the level of daily change on the
SP500index. - Add
Month,YearandYear-Monthcolumns. - Similarly, process the monthly level series (Step 3 and 5).
- Merge daily and monthly data on
Year-Monthfor final processed table. - Load transformed data into BigQuery as the data warehouse.
Step 1: Create a dataproc cluster through create_dataproc_cluster.sh:
cd Spark/Dataproc Scripts
chmod +x create_dataproc_cluster.sh
./create_dataproc_cluster.shStep 2: Submit Spark job to Dataproc through submit_dataproc_job.sh:
chmod +x submit_dataproc_job.sh
./submit_dataproc_job.shEcoPulse leverages BigQuery as the cloud data warehouse, ensuring efficient storage and analytical querying of economic data. The data is structured, partitioned/clustered, and optimized to support high-performance queries while minimizing costs.
To enhance query performance and reduce costs, the main dataset is:
- โ Partitioned by date โ Improves query efficiency by filtering certain range of dates
- โ
Clustered (optional for further optimization) โ Groups data by categorical column
SP500_daily_change_category
Option 1: Use the bq CLI:
cd BigQuery
bq query --use_legacy_sql=false -q < create_partitioned_table.sqlOption 2: Use Query Editor in BigQuery console and run the SQL query in create_partitioned_table.sql.
Option 1: Use the bq CLI:
cd BigQuery
bq query --use_legacy_sql=false -q < create_partitioned_clustered_table.sqlOption 2: Use Query Editor in BigQuery console and run the SQL query in create_partitioned_clustered_table.sql.
- Partitioning (
date) โ First, BigQuery divides the data into partitions based on the date column. - Clustering (
SP500_daily_change_category) โ Within each partition, BigQuery physically sorts and groups data bySP500_daily_change_category. - Query Optimization:
- Queries filtering by
datescan only relevant partitions. - Queries filtering by
SP500_daily_change_categorywithin a partition run even faster due to clustering.
- Queries filtering by
EcoPulse leverages Looker Studio to create an interactive dashboard that visualizes key economic indicators. The dashboard provides near real-time insights into economic trends, enabling data-driven decision-making.
Before building the dashboard, ensure you have:
- A Google Cloud account with access to BigQuery.
- A BigQuery dataset (
ecopulse_bq_dw) containing processed economic data that Looker Studio can connect to.
- Date Range Controls
- Daily Trending of Financial Market Signals
- Distribution of SP500 Daily Change Category
- Monthly Trending of Inflation and Housing Signals
- Table of Labor Market and Economic Activity Signals
Link to the Looker Studio Dashboard here.
- Add additional economic indicators (e.g., international trade data)
- Automate Spark job execution via Kestra
- Implement scheduled workflows for daily updates
- Expand dashboard visualizations with predictive analytics
- Replicate the architecture on Azure and AWS
EcoPulse is built using:
- FRED API for the source of economic data
- Google Cloud Platform (GCS, BigQuery, Dataproc)
- Kestra for workflow orchestration
- Apache Spark for scalable data transformations
- Looker Studio for visualization
This project is created as part of the Data Engineering Zoomcamp 2025 course. Special thanks to DataTalkClub for the learning opportunity.



