🚀 End-to-end real-time clickstream analytics pipeline using Kafka, Spark (Delta Lake), Airflow, DuckDB, and Power BI. Cloud-ready design with Bronze-Silver-Gold architecture, easily portable to Azure (ADLS, Databricks, Synapse).
Flow:
Clickstream Simulator → Kafka → Spark Structured Streaming → Delta Lake (bronze/silver/gold) → Airflow (batch orchestration) → DuckDB (Synapse simulation) → Power BI
- Storage: Local filesystem (simulating Azure Data Lake Gen2 with
bronze/silver/gold
zones) - Streaming: Apache Kafka (Docker)
- Processing: Apache Spark (PySpark + Delta Lake)
- Orchestration: Apache Airflow (optional, via pip or Docker)
- Warehouse: DuckDB (simulating Azure Synapse)
- Visualization: Power BI Desktop (free) or Metabase (open-source)
clickstream-pipeline/
├─ docker/
│ └─ docker-compose-kafka.yml # Kafka + Zookeeper setup
├─ airflow/
│ └─ dags/
│ └─ etl_dag.py # Airflow DAG (optional)
├─ src/
│ ├─ data_generator.py # Kafka producer (simulated clickstream)
│ ├─ spark_streaming.py # Streaming ingestion → Delta (bronze)
│ ├─ batch_transform.py # Batch ETL (bronze → silver → gold)
│ └─ export_to_duckdb.py # Export gold data to DuckDB/CSV
├─ data_lake/ # Local "ADLS" (bronze/silver/gold zones)
└─ README.md
git clone https://github.com/yourusername/clickstream-pipeline.git
cd clickstream-pipeline
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install kafka-python pyspark delta-spark duckdb pandas
- Install Docker & Docker Compose
- Install Java 11 JDK (required by Spark)
- (Optional) Install Airflow for orchestration
cd docker
docker-compose -f docker-compose-kafka.yml up -d
Create topic:
docker exec -it $(docker ps --filter "ancestor=bitnami/kafka" -q) \
/opt/bitnami/kafka/bin/kafka-topics.sh \
--create --topic clickstream-events \
--bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
python src/data_generator.py
➡️ Continuously sends synthetic user clickstream events to Kafka.
python src/spark_streaming.py
➡️ Consumes Kafka events → stores raw data in Delta Lake (bronze) under ./data_lake/bronze
.
python src/batch_transform.py
➡️ Cleans and aggregates events, creating session summary tables in gold zone.
python src/export_to_duckdb.py
➡️ Loads curated data into DuckDB and exports sample CSV (session_sample.csv
).
- Power BI Desktop (Windows):
- Open
session_sample.csv
- Build dashboards: DAU, page views, session duration, funnel conversion
- Open
- Alternative (cross-platform):
- Run Metabase via Docker and connect to DuckDB/CSV
# Terminal 1 - start Kafka
docker-compose -f docker/docker-compose-kafka.yml up -d
# Terminal 2 - run Spark streaming (keep running)
python src/spark_streaming.py
# Terminal 3 - run clickstream generator
python src/data_generator.py
# Terminal 4 - batch ETL + export
python src/batch_transform.py
python src/export_to_duckdb.py
Then open CSV/Parquet in Power BI or Metabase to build dashboards 🎉
- Replace local
data_lake/
→ Azure Data Lake Gen2 (ADLS Gen2) - Replace DuckDB → Azure Synapse Analytics
- Deploy Kafka & Airflow → Azure Kubernetes Service (AKS)
- Run Spark jobs → Azure Databricks