Project for the Data Intensive Application & Big Data exam, University of Perugia.
A data streaming pipeline that simulates IoT sensors, sends data to Kafka, processes it with Spark on a Hadoop cluster, and finally saves it to MongoDB. The entire environment is containerized with Docker Compose.
Author: Omar Criacci (omar.criacci@student.unipg.it)
Version: 1.0.0
- Prerequisites
- Quick Start
- Usage Example
- Web Interfaces
- Simulator Configuration
- Architecture
- Structure
- Useful Commands
- Docker and Docker Compose
- jq (optional, for the simulation script)
- At least 8 GB of RAM - but more is better
-
Clone the repository:
git clone https://github.com/omartrj/MASD.git cd MASD -
Start the entire stack (Kafka, MongoDB, Hadoop, Spark):
# Minimal setup docker compose up -d --build # To also start the web UIs for Kafka and MongoDB (optional): docker compose --profile web-ui up -d --build # To scale the Hadoop cluster (optional): docker compose up -d --build --scale hdfs-datanode=2 --scale yarn-nodemanager=2 # Everything docker compose --profile web-ui up -d --build --scale hdfs-datanode=2 --scale yarn-nodemanager=2
-
Start the simulators:
./run_simulation.sh -c simulator/config.json
Note: To stop the simulators, press
CTRL+C. The script will automatically terminate all simulator containers.
If you don't have jq installed (you should get it, it's useful!), you can start the simulators manually:
- Build the simulator image:
docker build -t masd-simulator:latest ./simulator
- Load the environment variables:
source .env - Start each station manually:
Replace the values between
docker run -d \ --name "simulator-<station_id>" \ --network "masd-network" \ -e SIM_STATION_NAME="<station_name>" \ -e SIM_STATION_ID="<station_id>" \ -e SIM_NUM_SENSORS="<num_sensors>" \ -e SIM_INTERVAL_MEAN_MS="<mean_ms>" \ -e SIM_INTERVAL_STDDEV_PCT="<stddev_pct>" \ -e SIM_MALFORMED_PCT="<malformation_pct>" \ -e KAFKA_BOOTSTRAP_SERVERS=$KAFKA_BOOTSTRAP_SERVERS \ -e KAFKA_TOPIC_PREFIX=$KAFKA_TOPIC_PREFIX \ masd-simulator:latest
<>with the desired parameters for the station (see simulator configuration for details).
For a detailed step-by-step guide on how to verify the data flow (from generation to storage), check the usage example.
Monitor the pipeline using these dashboards (if everything started correctly 🤞).
Standard:
- Hadoop NameNode on localhost:9870: HDFS status and file browser.
- YARN ResourceManager on localhost:8088: Cluster resources and Spark job status.
Optional (requires --profile web-ui):
- Kafka UI on localhost:8080 (by Provectus): Manage topics, view messages, and monitor consumers.
- Mongo Express on localhost:8081 (by mongo-express): Admin interface for MongoDB collections.
The simulators are configured via the simulator/config.json file:
Each station is started as a separate Docker container and publishes to a dedicated Kafka topic: sensors.raw.<station_id>.
The pipeline consists of several components orchestrated by Docker Compose to create a complete data streaming and processing environment.
- 🤖 Simulator: Python-based producer simulating IoT sensors.
- 📬 Kafka + ZooKeeper: Distributed message broker for data ingestion.
- ✨ Spark: Real-time data processing engine running on YARN.
- 💾 MongoDB: NoSQL database for storing aggregated results.
- 🐘 Hadoop (HDFS + YARN): Distributed storage and resource management.
For a detailed explanation of the architecture and configuration of each component, please refer to the architecture documentation.
# View logs
docker logs -f container_id
# Stop everything
docker compose down
# Stop and remove volumes
docker compose down -v
# (If web UIs were started)
docker compose --profile web-ui down -v
{ "sensors": { "send_interval": { "mean_ms": 250, // Average send interval in milliseconds "stddev_pct": 0.2 // Standard deviation of send interval (20%) }, "malformation_pct": 0.05 // Percentage of malformed data (5%) }, "stations": [ { "name": "Perugia", // Station name "id": "perugia", // Unique ID (used for Kafka topic and MongoDB collection) "num_sensors": 3 // Number of sensors for this station }, // ... other stations ] }