MASD — Monitoring & Analytics of Streaming Data

Project for the Data Intensive Application & Big Data exam, University of Perugia.

A data streaming pipeline that simulates IoT sensors, sends data to Kafka, processes it with Spark on a Hadoop cluster, and finally saves it to MongoDB. The entire environment is containerized with Docker Compose.

Author: Omar Criacci (omar.criacci@student.unipg.it)
Version: 1.0.0

📋 Table of Contents

Prerequisites
Quick Start
- Automatic (Recommended)
- Manual Startup (without jq)
Usage Example
Web Interfaces
Simulator Configuration
Architecture
Structure
Useful Commands

✅ Prerequisites

Docker and Docker Compose
jq (optional, for the simulation script)
At least 8 GB of RAM - but more is better

🚀 Quickstart

Automatic (Recommended)

Clone the repository:

git clone https://github.com/omartrj/MASD.git
cd MASD

Start the entire stack (Kafka, MongoDB, Hadoop, Spark):

# Minimal setup
docker compose up -d --build

# To also start the web UIs for Kafka and MongoDB (optional):
docker compose --profile web-ui up -d --build

# To scale the Hadoop cluster (optional):
docker compose up -d --build --scale hdfs-datanode=2 --scale yarn-nodemanager=2

# Everything
docker compose --profile web-ui up -d --build --scale hdfs-datanode=2 --scale yarn-nodemanager=2

Start the simulators:
```
./run_simulation.sh -c simulator/config.json
```
Note: To stop the simulators, press CTRL+C. The script will automatically terminate all simulator containers.

Manual Startup (without jq)

If you don't have jq installed (you should get it, it's useful!), you can start the simulators manually:

Build the simulator image:

docker build -t masd-simulator:latest ./simulator

Load the environment variables:
```
source .env
```

Start each station manually:

docker run -d \
  --name "simulator-<station_id>" \
  --network "masd-network" \
  -e SIM_STATION_NAME="<station_name>" \
  -e SIM_STATION_ID="<station_id>" \
  -e SIM_NUM_SENSORS="<num_sensors>" \
  -e SIM_INTERVAL_MEAN_MS="<mean_ms>" \
  -e SIM_INTERVAL_STDDEV_PCT="<stddev_pct>" \
  -e SIM_MALFORMED_PCT="<malformation_pct>" \
  -e KAFKA_BOOTSTRAP_SERVERS=$KAFKA_BOOTSTRAP_SERVERS \
  -e KAFKA_TOPIC_PREFIX=$KAFKA_TOPIC_PREFIX \
  masd-simulator:latest

Replace the values between <> with the desired parameters for the station (see simulator configuration for details).

📖 Usage Example

For a detailed step-by-step guide on how to verify the data flow (from generation to storage), check the usage example.

🌐 Web Interfaces

Monitor the pipeline using these dashboards (if everything started correctly 🤞).

Standard:

Hadoop NameNode on localhost:9870: HDFS status and file browser.
YARN ResourceManager on localhost:8088: Cluster resources and Spark job status.

Optional (requires --profile web-ui):

Kafka UI on localhost:8080 (by Provectus): Manage topics, view messages, and monitor consumers.
Mongo Express on localhost:8081 (by mongo-express): Admin interface for MongoDB collections.

⚙️ Simulator Configuration

The simulators are configured via the simulator/config.json file:

{
    "sensors": {
        "send_interval": {
            "mean_ms": 250,        // Average send interval in milliseconds
            "stddev_pct": 0.2      // Standard deviation of send interval (20%)
        },
        "malformation_pct": 0.05   // Percentage of malformed data (5%)
    },
    "stations": [
        {
            "name": "Perugia",     // Station name
            "id": "perugia",       // Unique ID (used for Kafka topic and MongoDB collection)
            "num_sensors": 3       // Number of sensors for this station
        },
        // ... other stations
    ]
}

Each station is started as a separate Docker container and publishes to a dedicated Kafka topic: sensors.raw.<station_id>.

🏗️ Architecture

The pipeline consists of several components orchestrated by Docker Compose to create a complete data streaming and processing environment.

🤖 Simulator: Python-based producer simulating IoT sensors.
📬 Kafka + ZooKeeper: Distributed message broker for data ingestion.
✨ Spark: Real-time data processing engine running on YARN.
💾 MongoDB: NoSQL database for storing aggregated results.
🐘 Hadoop (HDFS + YARN): Distributed storage and resource management.

For a detailed explanation of the architecture and configuration of each component, please refer to the architecture documentation.

💡 Useful Commands

# View logs
docker logs -f container_id

# Stop everything
docker compose down

# Stop and remove volumes
docker compose down -v

# (If web UIs were started)
docker compose --profile web-ui down -v

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
compose		compose
docs		docs
simulator		simulator
spark-app		spark-app
.env		.env
README.md		README.md
docker-compose.yml		docker-compose.yml
hadoop.config		hadoop.config
run_simulation.sh		run_simulation.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MASD — Monitoring & Analytics of Streaming Data

📋 Table of Contents

✅ Prerequisites

🚀 Quickstart

Automatic (Recommended)

Manual Startup (without jq)

📖 Usage Example

🌐 Web Interfaces

⚙️ Simulator Configuration

🏗️ Architecture

💡 Useful Commands

About

Uh oh!

Releases

Packages

Languages

omartrj/MASD

Folders and files

Latest commit

History

Repository files navigation

MASD — Monitoring & Analytics of Streaming Data

📋 Table of Contents

✅ Prerequisites

🚀 Quickstart

Automatic (Recommended)

Manual Startup (without jq)

📖 Usage Example

🌐 Web Interfaces

⚙️ Simulator Configuration

🏗️ Architecture

💡 Useful Commands

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages