FlightDelay

FlightDelay is a real-time flight delay prediction system that uses streaming data processing and machine learning to forecast severe delays (≥60 minutes) before they occur. The system ingests live flight data through Apache Kafka, processes it with Apache Spark Structured Streaming, and applies trained Gradient Boosted Tree models to deliver actionable predictions via an interactive web dashboard.

📚 Documentation

For detailed guides, technical info, and step-by-step instructions, visit our wiki:

Home – Project overview, motivation, and key features
Core Features – Real-time streaming, ML models, and visualizations
Installation & Setup – Complete local and GCP deployment guide
Tech Stack – Big data tools, ML frameworks, and libraries
Project Structure – File organization and architecture
Our Team – Meet the developers behind FlightDelay

Prerequisites

Before starting, ensure you have:

Python 3.8+ – Download Python
Apache Kafka 2.13 – Download Kafka
Apache Spark 3.5.1 – Download Spark
Jupyter Lab – Install Jupyter
GCP Account (for VM deployment) or local machine with 16GB+ RAM
Dataset: flight_data.csv (historical flight records)

Optional:

VS Code with Live Server Extension
MongoDB (if using MongoDB storage)

Quick Setup

1. Clone & Navigate

git clone https://github.com/NolanP2003/FlightDelay.git
cd FlightDelay

2. Install Python Dependencies

# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install required packages
pip install pyspark==3.5.1 pandas numpy matplotlib seaborn kafka-python flask findspark

Verify installation:

python3 -c "import pyspark; print(pyspark.__version__)"
# Should output: 3.5.1

3. Set Up Apache Kafka

On Linux/Ubuntu:

# Download and extract Kafka
wget https://downloads.apache.org/kafka/3.5.1/kafka_2.13-3.5.1.tgz
tar -xzf kafka_2.13-3.5.1.tgz
cd kafka_2.13-3.5.1

# Start Zookeeper (Terminal 1)
bin/zookeeper-server-start.sh config/zookeeper.properties

# Start Kafka Server (Terminal 2)
bin/kafka-server-start.sh config/server.properties

On macOS (with Homebrew):

brew install kafka
brew services start zookeeper
brew services start kafka

Verify Kafka is running:

sudo systemctl status kafka
# Should show: active (running)

4. Set Up Apache Spark

Download and configure Spark:

wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
tar -xzf spark-3.5.1-bin-hadoop3.tgz
sudo mv spark-3.5.1-bin-hadoop3 /opt/spark

Add to your PATH (in ~/.bashrc or ~/.zshrc):

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3

Apply changes:

source ~/.bashrc  # or source ~/.zshrc

5. Prepare Dataset

Place your flight_data.csv file in the project root directory
The dataset should include columns:
- FL_DATE, AIRLINE, AIRLINE_CODE, ORIGIN, DEST
- CRS_DEP_TIME, CRS_ARR_TIME, CRS_ELAPSED_TIME
- DEP_TIME, ARR_TIME, DEP_DELAY, ARR_DELAY
- DISTANCE, CANCELLED, DIVERTED
- Delay cause columns (DELAY_DUE_CARRIER, etc.)

Verify dataset location:

ls -lh flight_data.csv
# Should show file size (typically 100MB - 2GB)

⚠️ IMPORTANT: The CSV file is NOT included in the repository. You must provide your own flight dataset.

6. Start Jupyter Lab

jupyter lab
# Server will start at http://localhost:8888

Your browser should automatically open. If not, copy the URL from the terminal.

7. Train Models (One-Time Setup)

This step must be completed before streaming predictions.

Step 1: Data Visualization (Optional)

In Jupyter Lab, open Step1_Data_Visualization.ipynb
Click Run All Cells (Cell → Run All)
This performs EDA and generates visualizations
Time required: ~10-15 minutes

Step 2: Batch Processing (Required)

In Jupyter Lab, open Step2_Batch_Processing.ipynb
Click Run All Cells (Cell → Run All)
This will:
- Clean and preprocess the dataset
- Engineer features
- Train GBT, Random Forest, and Logistic Regression models
- Evaluate models with confusion matrices
- Save trained models to disk
Time required: ~20-30 minutes

Verify models were saved:

ls -lh flight_delay_gbt_pipeline_model/
ls -lh flight_delay_gbt_model/
# Both directories should contain Spark model files

⚠️ CRITICAL: Do not skip Step 2. The streaming pipeline requires these saved models.

Start Real-Time Streaming Pipeline

You'll run three processes simultaneously: Kafka Producer, Spark Streaming, and Flask Dashboard.

Terminal 1 - Start Kafka Producer

From the project directory:

python3 kafka_producer.py --csv flight_data.csv --speed 100

Command-line options:

--csv: Path to CSV file (default: flight_data.csv)
--topic: Kafka topic name (default: flight_data_stream)
--servers: Kafka bootstrap servers (default: localhost:9092)
--speed: Records per second (default: 100, use 0 for max speed)

Expected output:

Starting to stream data from flight_data.csv to Kafka topic 'flight_data_stream'...
CSV Headers: ['FL_DATE', 'AIRLINE', 'AIRLINE_CODE', ...]
Sent 1000 records...
Sent 2000 records...

Keep this terminal running. The producer continuously sends flight data to Kafka.

Terminal 2 (Jupyter Lab) - Start Spark Streaming

In Jupyter Lab:

Open Step3_Streaming_Prediction.ipynb
Click Run All Cells (Cell → Run All)
The notebook will:
- Load saved preprocessing pipeline and GBT model
- Connect to Kafka topic flight_data_stream
- Process streaming data in real-time
- Generate predictions for each flight
- Save predictions to Parquet files

Expected output:

-------------------------------------------
Batch: 0
-------------------------------------------
+----------+------------+------+-----+------------+--------------------+-------------------------+
|FL_DATE   |AIRLINE_CODE|ORIGIN|DEST |CRS_DEP_TIME|Prediction_Label    |Probability_Severe_Delay |
+----------+------------+------+-----+------------+--------------------+-------------------------+
|2024-01-15|AA          |LAX   |JFK  |800         |Severe Delay Pred...|0.78                     |
|2024-01-15|UA          |ORD   |SFO  |1430        |No Severe Delay P...|0.12                     |
+----------+------------+------+-----+------------+--------------------+-------------------------+

Keep this notebook running. It continuously processes streaming data.

Terminal 3 - Start Flask Dashboard

Open a new terminal and run:

cd FlightDelay
python3 app.py

Expected output:

🚀 Flask app started!
 * Running on http://0.0.0.0:5000
 * Debug mode: on

Access the Dashboard

Local Machine:

http://localhost:5000

GCP VM:

http://<YOUR_VM_EXTERNAL_IP>:5000

To find your GCP VM's external IP:

curl ifconfig.me

The dashboard auto-refreshes every 10 seconds and displays:

Top 50 most recent flight predictions
Flight details (date, airline, origin, destination, scheduled departure)
Prediction label (Severe Delay Predicted vs No Severe Delay)
Probability of severe delay (percentage)

View Dataset Visualizations (Local Machine)

The Dataset Application is a separate static web app for exploring pre-generated visualizations.

Setup

On your local machine (not on GCP VM), open VS Code
Open the FlightDelay folder in VS Code
Ensure Live Server extension by Ritwick Dey is installed:
- Extensions → Search "Live Server" → Install

Launch Visualization App

In VS Code Explorer, navigate to: Dataset Application/index.html
Right-click index.html → Select "Open with Live Server"
Your browser will open to: http://127.0.0.1:5500/Dataset%20Application/index.html

Using the Visualization Matrix

7×5 Grid (Dataset Visualizations):

Columns: Select a category (Day of Week, Month, Hour, Origin Airport, Dest Airport, Airline, Cause of Delay)
Rows: Select a metric (Average Arrival Delay, Number of Flights, Severe Delays, Proportion by Severity, Ratio of Flight Time/Delay)
Image Updates: Click any row + column to see corresponding visualization

3D Matrix (Model Evaluations):

Dimension 1: Confusion Matrix or Model Evaluation
Dimension 2: Model Type (RFC, LR, GBT)
Dimension 3: Class Balancing (Weighting or Resampling)
Result: View confusion matrices and evaluation metrics for all model combinations

Alternative Setup

Backend Only (Streaming Pipeline)

# Terminal 1: Kafka Producer
python3 kafka_producer.py --csv flight_data.csv --speed 100

# Terminal 2: Jupyter Lab
jupyter lab
# Then run Step3_Streaming_Prediction.ipynb

# Terminal 3: Flask Dashboard
python3 app.py

Visualization App Only

# Open Dataset Application/index.html in VS Code
# Right-click → "Open with Live Server"

Verify Installation

1. Check All Services Are Running

Kafka:

sudo systemctl status kafka
# Should show: active (running)

Jupyter Lab:

Open browser to http://localhost:8888
You should see the Jupyter Lab interface

Kafka Producer:

Terminal 1 should show "Sent X records..." messages

Spark Streaming:

Jupyter notebook should display batch predictions

Flask Dashboard:

curl http://localhost:5000
# Should return HTML content

2. Test the Prediction Dashboard

Open browser to http://localhost:5000 (or http://<VM_IP>:5000)
You should see the FlightDelay dashboard
The table should populate with predictions (may take 30-60 seconds for first batch)
Verify:
- Flight details are displayed correctly
- Prediction labels show "Severe Delay Predicted" or "No Severe Delay Predicted"
- Probability percentages are displayed
- Page auto-refreshes every 10 seconds

3. Test the Dataset Application

Open Dataset Application/index.html with Live Server
You should see the visualization matrix interface
Click any row button and any column button
An image should appear showing the selected visualization
Try the 3D matrix section (model evaluations)

Project Workflow Summary

Data Exploration → Run Step1_Data_Visualization.ipynb
Model Training → Run Step2_Batch_Processing.ipynb (saves models)
Start Streaming → Run kafka_producer.py (Terminal 1)
Start Predictions → Run Step3_Streaming_Prediction.ipynb (Jupyter)
Start Dashboard → Run app.py (Terminal 2)
View Predictions → Access http://<VM_IP>:5000
Explore Visualizations → Open Dataset Application/index.html (local machine)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Apache Kafka & Apache Spark communities for excellent documentation
PySpark MLlib for distributed machine learning capabilities
Flask for lightweight web framework
Google Cloud Platform for VM infrastructure

Last Updated: December 23, 2024

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Dataset Application		Dataset Application
Docs		Docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Step1_Data_Visualization.ipynb		Step1_Data_Visualization.ipynb
Step2_Batch_Processing.ipynb		Step2_Batch_Processing.ipynb
Step3_Streaming_Prediction.ipynb		Step3_Streaming_Prediction.ipynb
app.py		app.py
kafka_producer.py		kafka_producer.py

Folders and files

Latest commit

History

Repository files navigation

FlightDelay

📚 Documentation

Prerequisites

Quick Setup

1. Clone & Navigate

2. Install Python Dependencies

3. Set Up Apache Kafka

4. Set Up Apache Spark

5. Prepare Dataset

6. Start Jupyter Lab

7. Train Models (One-Time Setup)

Start Real-Time Streaming Pipeline

Terminal 1 - Start Kafka Producer

Terminal 2 (Jupyter Lab) - Start Spark Streaming

Terminal 3 - Start Flask Dashboard

Access the Dashboard

View Dataset Visualizations (Local Machine)

Setup

Launch Visualization App

Using the Visualization Matrix

Alternative Setup

Backend Only (Streaming Pipeline)

Visualization App Only

Verify Installation

1. Check All Services Are Running

2. Test the Prediction Dashboard

3. Test the Dataset Application

Project Workflow Summary

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages