Project Structure

📁 File Organization

FlightDelay/
├── Dataset Application/              # Static visualization web app
│   ├── index.html                   # Main HTML interface with matrix selector
│   ├── script.js                    # Interactive button logic and image loading
│   ├── styles.css                   # Styling, layout, and responsive design
│   └── Visualizations/              # Pre-generated PNG visualizations
│       ├── 00.png - 46.png          # 7×5 dataset visualization matrix
│       └── 800.png - 921.png        # 2×3×2 model evaluation matrix
│
├── Docs/                             # Additional documentation (optional)
│   └── [Project reports, presentations, notes]
│
├── Step1_Data_Visualization.ipynb    # Exploratory Data Analysis (EDA)
├── Step2_Batch_Processing.ipynb      # Model training and evaluation
├── Step3_Streaming_Prediction.ipynb  # Real-time inference pipeline
│
├── app.py                            # Flask web server for prediction dashboard
├── kafka_producer.py                 # Kafka producer for streaming CSV data
│
├── flight_data.csv                   # Input dataset (NOT in repository)
│
├── .gitignore                        # Git exclusion rules
├── LICENSE                           # MIT License
└── README.md                         # Setup and usage instructions

📄 Key Files Explained

Jupyter Notebooks

Step1_Data_Visualization.ipynb

Purpose: Exploratory Data Analysis (EDA) and data visualization
Key Tasks:
- Load and inspect flight dataset
- Clean data (drop nulls, handle outliers)
- Engineer temporal and derived features
- Generate descriptive statistics
- Create correlation matrix heatmap
- Visualize arrival delay distributions
- Export 35 PNG visualizations to Dataset Application/Visualizations/
Dependencies: PySpark, Pandas, Matplotlib, Seaborn, findspark
Outputs: Statistical summaries, correlation insights, PNG images

Step2_Batch_Processing.ipynb

Purpose: Train and evaluate machine learning models
Key Tasks:
- Load dataset with defined schema
- Preprocess data (drop irrelevant columns, handle nulls)
- Engineer features (temporal, categorical, numerical)
- Build ML pipeline (StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler)
- Train three models: GBT, Random Forest, Logistic Regression
- Handle class imbalance with weighting or resampling
- Evaluate models (AUC, accuracy, precision, recall, F1, confusion matrix)
- Extract and visualize feature importance
- Save models to disk for production use
Dependencies: PySpark MLlib, Pandas, NumPy, Matplotlib, Seaborn
Outputs:
- flight_delay_gbt_pipeline_model/ (preprocessing pipeline)
- flight_delay_gbt_model/ (trained GBT classifier)

Step3_Streaming_Prediction.ipynb

Purpose: Real-time streaming predictions with Spark Structured Streaming
Key Tasks:
- Initialize Spark session with Kafka connector
- Load saved preprocessing pipeline and GBT model
- Connect to Kafka topic flight_data_stream
- Parse JSON messages into structured DataFrame
- Apply feature engineering in real-time
- Transform data using saved pipeline
- Generate predictions with trained model
- Output predictions to console, Parquet files, and MongoDB
- Maintain checkpoints for fault tolerance
Dependencies: PySpark, Kafka, MongoDB connector (optional)
Outputs:
- streaming_predictions_output/ (Parquet files)
- streaming_predictions_checkpoint/ (Spark checkpoints)
- Console logs with batch predictions

Python Scripts

app.py

Purpose: Flask web server for real-time prediction dashboard
Key Features:
- Reads Parquet files from ./streaming_predictions_output/
- Aggregates and sorts predictions by flight date and departure time
- Displays top 50 most recent predictions in HTML table
- Auto-refreshes every 10 seconds
- Color-codes predictions (red = severe delay, green = no delay)
- Shows probability percentages
Technologies: Flask, Pandas, Jinja2 templating
Server: Runs on 0.0.0.0:5000 (accessible via http://<VM_IP>:5000)
Dependencies: Flask, Pandas, glob, os

kafka_producer.py

Purpose: Stream CSV data to Kafka topic in real-time
Key Features:
- Reads flight_data.csv row by row
- Converts CSV rows to JSON format
- Publishes messages to Kafka topic flight_data_stream
- Configurable streaming speed (records per second)
- Batch flushing every 1,000 records
- Type conversion (int, float, string) for proper data types
- Command-line interface with argparse
CLI Options:
- --csv: Path to CSV file (default: flight_data.csv)
- --topic: Kafka topic name (default: flight_data_stream)
- --servers: Kafka bootstrap servers (default: localhost:9092)
- --speed: Records/second (default: 100, 0 = max speed)
Dependencies: kafka-python, argparse, json, csv

Dataset Application (Static Web App)

Dataset Application/index.html

Purpose: Interactive web interface for exploring visualizations
Features:
- 7×5 matrix selector for dataset visualizations
- 3D matrix selector (2×3×2) for model evaluation visualizations
- Button-based navigation (row + column selection)
- Dynamic image loading based on selection
- Active button highlighting (green)
- Current selection display
- Graceful handling of missing images
Dependencies: Embedded JavaScript (script.js), CSS (styles.css)

Dataset Application/script.js

Purpose: Client-side logic for interactive image selection
Features:
- Event listeners for column, row, and 3D dimension buttons
- Dynamic image path construction (./Visualizations/{row}{col}.png)
- Active state management (green highlighting)
- Image error handling (displays "Image not found")
- Selection text updates
- Two separate visualization systems (2D matrix and 3D matrix)
Technologies: Vanilla JavaScript (ES6), DOM manipulation

Dataset Application/styles.css

Purpose: Visual styling and layout for the web app
Features:
- Flexbox-based responsive layout
- Button hover effects and active states
- Color-coded predictions (green for active)
- Styled image containers with borders
- Section dividers with gradient
- Professional typography and spacing
- Mobile-responsive design considerations
Technologies: CSS3, Flexbox

Dataset Application/Visualizations/

Purpose: Store pre-generated PNG visualization images
Contents:
- Dataset Matrix (00.png - 46.png):
  - 7 columns × 5 rows = 35 images
  - Categories: Day of Week, Month, Hour, Origin, Dest, Airline, Delay Cause
  - Metrics: Avg Delay, Flight Count, Severe Delays, Proportions, Ratios
- Model Evaluation Matrix (800.png - 921.png):
  - 2 evaluation types × 3 models × 2 balancing methods = 12 images
  - Confusion matrices and evaluation metrics
  - Models: RFC, LR, GBT
  - Balancing: Weighting, Resampling

Configuration & Documentation

.gitignore

Purpose: Specify files/folders to exclude from Git version control
Typical Exclusions:

  .DS_Store
  flight_data.csv
  flight_delay_gbt_pipeline_model/
  flight_delay_gbt_model/
  streaming_predictions_output/
  streaming_predictions_checkpoint/
  mongodb_checkpoint/
  *.pyc
  __pycache__/
  .ipynb_checkpoints/
  venv/
  .env
  *.log

LICENSE

Type: MIT License
Purpose: Define open-source usage terms and permissions
Allows: Commercial use, modification, distribution, private use
Requires: License and copyright notice inclusion
Liability: No warranty provided

README.md

Purpose: Primary project documentation and setup guide
Contents:
- Project overview
- Streaming service setup instructions
- GCP VM deployment steps
- Kafka, Spark, Flask configuration
- Dataset application usage
- Step-by-step workflow
- Access URLs and commands
Target Audience: Developers, data scientists, instructors

🗂️ Generated Artifacts (Not in Git Repository)

These files and folders are created during runtime and are excluded from version control:

Model Files

flight_delay_gbt_pipeline_model/

Created By: Step2_Batch_Processing.ipynb
Purpose: Saved Spark ML preprocessing pipeline
Contains:
- StringIndexer models for categorical features (AIRLINE_CODE, ORIGIN, DEST)
- OneHotEncoder models for categorical encoding
- VectorAssembler for numerical features
- StandardScaler for feature normalization
- Final VectorAssembler combining all features
Size: ~10-50 MB (depending on unique categorical values)
Format: Spark PipelineModel (metadata + binary data)

flight_delay_gbt_model/

Created By: Step2_Batch_Processing.ipynb
Purpose: Saved trained Gradient Boosted Tree classifier
Contains:
- GBT model weights and structure
- Feature importance scores
- Model metadata (hyperparameters, training metrics)
Size: ~20-100 MB (depending on tree depth and iterations)
Format: Spark GBTClassificationModel (metadata + binary data)

Streaming Output

streaming_predictions_output/

Created By: Step3_Streaming_Prediction.ipynb
Purpose: Store real-time predictions in Parquet format
Contains:
- Multiple .parquet files (one per micro-batch)
- Columns: FL_DATE, AIRLINE_CODE, ORIGIN, DEST, CRS_DEP_TIME, Prediction_Label, Probability_Severe_Delay
- Partitioned by batch timestamp
Size: Grows continuously (1-5 MB per 1,000 records)
Format: Apache Parquet (columnar storage)
Read By: app.py (Flask dashboard)

streaming_predictions_checkpoint/

Created By: Step3_Streaming_Prediction.ipynb (Spark Structured Streaming)
Purpose: Enable fault tolerance and exactly-once processing
Contains:
- Offset tracking for Kafka consumer
- State metadata for streaming query
- Recovery information for restarts
Size: ~1-10 MB
Format: Spark checkpoint files (binary)
Behavior: Automatically managed by Spark

mongodb_checkpoint/

Created By: Step3_Streaming_Prediction.ipynb (if using MongoDB output)
Purpose: Checkpoint directory for MongoDB streaming sink
Contains:
- MongoDB-specific offset tracking
- Write-ahead logs for recovery
Size: ~1-5 MB
Format: Spark checkpoint files (binary)
Note: Only created if MongoDB output is enabled (commented out in current code)

Dataset (Not Included in Repository)

flight_data.csv

Source: External flight data provider (user-supplied)
Purpose: Input dataset for training and streaming
Typical Size: 100 MB - 2 GB (depending on number of records)
Columns: 32 fields including flight dates, airlines, airports, delays, distances
Excluded from Git: Too large for version control (listed in .gitignore)
Must Be Provided By User: Required for all notebooks to run

🔧 Dependencies & Configuration Files

Not Included (User Must Create)

requirements.txt (Optional)

pyspark==3.5.1
pandas>=1.5.0
numpy>=1.23.0
matplotlib>=3.5.0
seaborn>=0.12.0
kafka-python>=2.0.2
flask>=2.3.0
findspark>=2.0.1

Environment Variables (Optional)

# For production deployment
export SPARK_HOME=/opt/spark
export PYSPARK_PYTHON=python3
export KAFKA_HOME=/opt/kafka
export FLASK_ENV=production
export MONGODB_URI=mongodb://localhost:27017/flightdb

Embedded Configuration

Spark Configuration (in notebooks):

spark = SparkSession.builder \
    .appName("FlightDelayPrediction") \
    .config("spark.driver.memory", "16g") \
    .config("spark.executor.memory", "16g") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.1") \
    .getOrCreate()

Kafka Configuration (in kafka_producer.py):

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    linger_ms=10,
    batch_size=16384
)

Flask Configuration (in app.py):

STREAMING_OUTPUT_PATH = "./streaming_predictions_output/"
app.run(host="0.0.0.0", port=5000, debug=True)

📊 Data Flow Diagram

flight_data.csv
      ↓
kafka_producer.py → Kafka Topic (flight_data_stream)
      ↓
Step3_Streaming_Prediction.ipynb
      ↓
   ┌──────────────┐
   │ Load Models  │ ← flight_delay_gbt_pipeline_model/
   │              │ ← flight_delay_gbt_model/
   └──────────────┘
      ↓
   ┌──────────────┐
   │ Predictions  │ → streaming_predictions_output/ (Parquet)
   └──────────────┘ → Console (stdout)
      ↓                → MongoDB (optional)
   app.py (Flask)
      ↓
Web Dashboard (http://<VM_IP>:5000)

🔍 File Size Reference

File/Folder	Typical Size	Notes
`Step1_Data_Visualization.ipynb`	1-3 MB	Contains inline images and outputs
`Step2_Batch_Processing.ipynb`	2-5 MB	Contains confusion matrices and charts
`Step3_Streaming_Prediction.ipynb`	500 KB - 1 MB	Minimal outputs, mostly code
`app.py`	3-5 KB	Small Flask server script
`kafka_producer.py`	4-6 KB	Lightweight streaming script
`Dataset Application/`	50-200 MB	PNG images are largest component
`flight_data.csv`	100 MB - 2 GB	User-provided, not in repo
`flight_delay_gbt_pipeline_model/`	10-50 MB	Saved preprocessing pipeline
`flight_delay_gbt_model/`	20-100 MB	Trained GBT classifier
`streaming_predictions_output/`	Growing	Accumulates over time

🚀 Workflow Summary

Data Exploration → Run Step1_Data_Visualization.ipynb
Model Training → Run Step2_Batch_Processing.ipynb (saves models)
Start Streaming → Run kafka_producer.py (Terminal 1)
Start Predictions → Run Step3_Streaming_Prediction.ipynb (Jupyter)
Start Dashboard → Run app.py (Terminal 2)
View Predictions → Access http://<VM_IP>:5000
Explore Visualizations → Open Dataset Application/index.html (local machine)

💡 Best Practices

Version Control

✅ Commit: Notebooks, scripts, HTML/CSS/JS, README, LICENSE
❌ Exclude: Dataset (CSV), models, Parquet outputs, checkpoints, logs
📝 Update .gitignore to prevent accidental commits of large files

Data Management

Store flight_data.csv separately (Google Drive, S3, etc.)
Backup trained models before re-running Step2
Periodically clean streaming_predictions_output/ to save disk space
Use compression for long-term storage: tar -czf predictions.tar.gz streaming_predictions_output/

Development Workflow

Use virtual environments to isolate Python dependencies
Test notebooks on small data samples before full runs
Monitor disk space during streaming (Parquet files accumulate)
Keep separate development and production model directories

Documentation

Add inline comments to notebooks for clarity
Document custom modifications in Docs/ folder
Update README.md when changing deployment steps
Version models with timestamps (e.g., flight_delay_gbt_model_20241223/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Structure

📁 File Organization

📄 Key Files Explained

Jupyter Notebooks

Python Scripts

Dataset Application (Static Web App)

Configuration & Documentation

🗂️ Generated Artifacts (Not in Git Repository)

Model Files

Streaming Output

Dataset (Not Included in Repository)

🔧 Dependencies & Configuration Files

Not Included (User Must Create)

Embedded Configuration

📊 Data Flow Diagram

🔍 File Size Reference

🚀 Workflow Summary

💡 Best Practices

Version Control

Data Management

Development Workflow

Documentation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally