FlightDelay is a real-time flight delay prediction system that uses streaming data processing and machine learning to forecast severe delays (≥60 minutes) before they occur. The system ingests live flight data through Apache Kafka, processes it with Apache Spark Structured Streaming, and applies trained Gradient Boosted Tree models to deliver actionable predictions via an interactive web dashboard.
For detailed guides, technical info, and step-by-step instructions, visit our wiki:
- Home – Project overview, motivation, and key features
- Core Features – Real-time streaming, ML models, and visualizations
- Installation & Setup – Complete local and GCP deployment guide
- Tech Stack – Big data tools, ML frameworks, and libraries
- Project Structure – File organization and architecture
- Our Team – Meet the developers behind FlightDelay
Before starting, ensure you have:
- Python 3.8+ – Download Python
- Apache Kafka 2.13 – Download Kafka
- Apache Spark 3.5.1 – Download Spark
- Jupyter Lab – Install Jupyter
- GCP Account (for VM deployment) or local machine with 16GB+ RAM
- Dataset:
flight_data.csv(historical flight records)
Optional:
- VS Code with Live Server Extension
- MongoDB (if using MongoDB storage)
git clone https://github.com/NolanP2003/FlightDelay.git
cd FlightDelay# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install required packages
pip install pyspark==3.5.1 pandas numpy matplotlib seaborn kafka-python flask findsparkVerify installation:
python3 -c "import pyspark; print(pyspark.__version__)"
# Should output: 3.5.1On Linux/Ubuntu:
# Download and extract Kafka
wget https://downloads.apache.org/kafka/3.5.1/kafka_2.13-3.5.1.tgz
tar -xzf kafka_2.13-3.5.1.tgz
cd kafka_2.13-3.5.1
# Start Zookeeper (Terminal 1)
bin/zookeeper-server-start.sh config/zookeeper.properties
# Start Kafka Server (Terminal 2)
bin/kafka-server-start.sh config/server.propertiesOn macOS (with Homebrew):
brew install kafka
brew services start zookeeper
brew services start kafkaVerify Kafka is running:
sudo systemctl status kafka
# Should show: active (running)Download and configure Spark:
wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
tar -xzf spark-3.5.1-bin-hadoop3.tgz
sudo mv spark-3.5.1-bin-hadoop3 /opt/sparkAdd to your PATH (in ~/.bashrc or ~/.zshrc):
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3Apply changes:
source ~/.bashrc # or source ~/.zshrc- Place your
flight_data.csvfile in the project root directory - The dataset should include columns:
FL_DATE,AIRLINE,AIRLINE_CODE,ORIGIN,DESTCRS_DEP_TIME,CRS_ARR_TIME,CRS_ELAPSED_TIMEDEP_TIME,ARR_TIME,DEP_DELAY,ARR_DELAYDISTANCE,CANCELLED,DIVERTED- Delay cause columns (
DELAY_DUE_CARRIER, etc.)
Verify dataset location:
ls -lh flight_data.csv
# Should show file size (typically 100MB - 2GB)jupyter lab
# Server will start at http://localhost:8888Your browser should automatically open. If not, copy the URL from the terminal.
This step must be completed before streaming predictions.
Step 1: Data Visualization (Optional)
- In Jupyter Lab, open
Step1_Data_Visualization.ipynb - Click Run All Cells (Cell → Run All)
- This performs EDA and generates visualizations
- Time required: ~10-15 minutes
Step 2: Batch Processing (Required)
- In Jupyter Lab, open
Step2_Batch_Processing.ipynb - Click Run All Cells (Cell → Run All)
- This will:
- Clean and preprocess the dataset
- Engineer features
- Train GBT, Random Forest, and Logistic Regression models
- Evaluate models with confusion matrices
- Save trained models to disk
- Time required: ~20-30 minutes
Verify models were saved:
ls -lh flight_delay_gbt_pipeline_model/
ls -lh flight_delay_gbt_model/
# Both directories should contain Spark model filesYou'll run three processes simultaneously: Kafka Producer, Spark Streaming, and Flask Dashboard.
From the project directory:
python3 kafka_producer.py --csv flight_data.csv --speed 100Command-line options:
--csv: Path to CSV file (default:flight_data.csv)--topic: Kafka topic name (default:flight_data_stream)--servers: Kafka bootstrap servers (default:localhost:9092)--speed: Records per second (default: 100, use 0 for max speed)
Expected output:
Starting to stream data from flight_data.csv to Kafka topic 'flight_data_stream'...
CSV Headers: ['FL_DATE', 'AIRLINE', 'AIRLINE_CODE', ...]
Sent 1000 records...
Sent 2000 records...
Keep this terminal running. The producer continuously sends flight data to Kafka.
In Jupyter Lab:
- Open
Step3_Streaming_Prediction.ipynb - Click Run All Cells (Cell → Run All)
- The notebook will:
- Load saved preprocessing pipeline and GBT model
- Connect to Kafka topic
flight_data_stream - Process streaming data in real-time
- Generate predictions for each flight
- Save predictions to Parquet files
Expected output:
-------------------------------------------
Batch: 0
-------------------------------------------
+----------+------------+------+-----+------------+--------------------+-------------------------+
|FL_DATE |AIRLINE_CODE|ORIGIN|DEST |CRS_DEP_TIME|Prediction_Label |Probability_Severe_Delay |
+----------+------------+------+-----+------------+--------------------+-------------------------+
|2024-01-15|AA |LAX |JFK |800 |Severe Delay Pred...|0.78 |
|2024-01-15|UA |ORD |SFO |1430 |No Severe Delay P...|0.12 |
+----------+------------+------+-----+------------+--------------------+-------------------------+
Keep this notebook running. It continuously processes streaming data.
Open a new terminal and run:
cd FlightDelay
python3 app.pyExpected output:
🚀 Flask app started!
* Running on http://0.0.0.0:5000
* Debug mode: on
Local Machine:
http://localhost:5000
GCP VM:
http://<YOUR_VM_EXTERNAL_IP>:5000
To find your GCP VM's external IP:
curl ifconfig.meThe dashboard auto-refreshes every 10 seconds and displays:
- Top 50 most recent flight predictions
- Flight details (date, airline, origin, destination, scheduled departure)
- Prediction label (Severe Delay Predicted vs No Severe Delay)
- Probability of severe delay (percentage)
The Dataset Application is a separate static web app for exploring pre-generated visualizations.
- On your local machine (not on GCP VM), open VS Code
- Open the
FlightDelayfolder in VS Code - Ensure Live Server extension by Ritwick Dey is installed:
- Extensions → Search "Live Server" → Install
- In VS Code Explorer, navigate to:
Dataset Application/index.html - Right-click
index.html→ Select "Open with Live Server" - Your browser will open to:
http://127.0.0.1:5500/Dataset%20Application/index.html
7×5 Grid (Dataset Visualizations):
- Columns: Select a category (Day of Week, Month, Hour, Origin Airport, Dest Airport, Airline, Cause of Delay)
- Rows: Select a metric (Average Arrival Delay, Number of Flights, Severe Delays, Proportion by Severity, Ratio of Flight Time/Delay)
- Image Updates: Click any row + column to see corresponding visualization
3D Matrix (Model Evaluations):
- Dimension 1: Confusion Matrix or Model Evaluation
- Dimension 2: Model Type (RFC, LR, GBT)
- Dimension 3: Class Balancing (Weighting or Resampling)
- Result: View confusion matrices and evaluation metrics for all model combinations
# Terminal 1: Kafka Producer
python3 kafka_producer.py --csv flight_data.csv --speed 100
# Terminal 2: Jupyter Lab
jupyter lab
# Then run Step3_Streaming_Prediction.ipynb
# Terminal 3: Flask Dashboard
python3 app.py# Open Dataset Application/index.html in VS Code
# Right-click → "Open with Live Server"Kafka:
sudo systemctl status kafka
# Should show: active (running)Jupyter Lab:
- Open browser to
http://localhost:8888 - You should see the Jupyter Lab interface
Kafka Producer:
- Terminal 1 should show "Sent X records..." messages
Spark Streaming:
- Jupyter notebook should display batch predictions
Flask Dashboard:
curl http://localhost:5000
# Should return HTML content- Open browser to
http://localhost:5000(orhttp://<VM_IP>:5000) - You should see the FlightDelay dashboard
- The table should populate with predictions (may take 30-60 seconds for first batch)
- Verify:
- Flight details are displayed correctly
- Prediction labels show "Severe Delay Predicted" or "No Severe Delay Predicted"
- Probability percentages are displayed
- Page auto-refreshes every 10 seconds
- Open
Dataset Application/index.htmlwith Live Server - You should see the visualization matrix interface
- Click any row button and any column button
- An image should appear showing the selected visualization
- Try the 3D matrix section (model evaluations)
- Data Exploration → Run
Step1_Data_Visualization.ipynb - Model Training → Run
Step2_Batch_Processing.ipynb(saves models) - Start Streaming → Run
kafka_producer.py(Terminal 1) - Start Predictions → Run
Step3_Streaming_Prediction.ipynb(Jupyter) - Start Dashboard → Run
app.py(Terminal 2) - View Predictions → Access
http://<VM_IP>:5000 - Explore Visualizations → Open
Dataset Application/index.html(local machine)
This project is licensed under the MIT License - see the LICENSE file for details.
- Apache Kafka & Apache Spark communities for excellent documentation
- PySpark MLlib for distributed machine learning capabilities
- Flask for lightweight web framework
- Google Cloud Platform for VM infrastructure
Last Updated: December 23, 2024