Skip to content

Project Structure

Matthew Smith edited this page Dec 24, 2025 · 4 revisions

πŸ“ File Organization

FlightDelay/
β”œβ”€β”€ Dataset Application/              # Static visualization web app
β”‚   β”œβ”€β”€ index.html                   # Main HTML interface with matrix selector
β”‚   β”œβ”€β”€ script.js                    # Interactive button logic and image loading
β”‚   β”œβ”€β”€ styles.css                   # Styling, layout, and responsive design
β”‚   └── Visualizations/              # Pre-generated PNG visualizations
β”‚       β”œβ”€β”€ 00.png - 46.png          # 7Γ—5 dataset visualization matrix
β”‚       └── 800.png - 921.png        # 2Γ—3Γ—2 model evaluation matrix
β”‚
β”œβ”€β”€ Docs/                             # Additional documentation (optional)
β”‚   └── [Project reports, presentations, notes]
β”‚
β”œβ”€β”€ Step1_Data_Visualization.ipynb    # Exploratory Data Analysis (EDA)
β”œβ”€β”€ Step2_Batch_Processing.ipynb      # Model training and evaluation
β”œβ”€β”€ Step3_Streaming_Prediction.ipynb  # Real-time inference pipeline
β”‚
β”œβ”€β”€ app.py                            # Flask web server for prediction dashboard
β”œβ”€β”€ kafka_producer.py                 # Kafka producer for streaming CSV data
β”‚
β”œβ”€β”€ flight_data.csv                   # Input dataset (NOT in repository)
β”‚
β”œβ”€β”€ .gitignore                        # Git exclusion rules
β”œβ”€β”€ LICENSE                           # MIT License
└── README.md                         # Setup and usage instructions

πŸ“„ Key Files Explained

Jupyter Notebooks

Step1_Data_Visualization.ipynb

  • Purpose: Exploratory Data Analysis (EDA) and data visualization
  • Key Tasks:
    • Load and inspect flight dataset
    • Clean data (drop nulls, handle outliers)
    • Engineer temporal and derived features
    • Generate descriptive statistics
    • Create correlation matrix heatmap
    • Visualize arrival delay distributions
    • Export 35 PNG visualizations to Dataset Application/Visualizations/
  • Dependencies: PySpark, Pandas, Matplotlib, Seaborn, findspark
  • Outputs: Statistical summaries, correlation insights, PNG images

Step2_Batch_Processing.ipynb

  • Purpose: Train and evaluate machine learning models
  • Key Tasks:
    • Load dataset with defined schema
    • Preprocess data (drop irrelevant columns, handle nulls)
    • Engineer features (temporal, categorical, numerical)
    • Build ML pipeline (StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler)
    • Train three models: GBT, Random Forest, Logistic Regression
    • Handle class imbalance with weighting or resampling
    • Evaluate models (AUC, accuracy, precision, recall, F1, confusion matrix)
    • Extract and visualize feature importance
    • Save models to disk for production use
  • Dependencies: PySpark MLlib, Pandas, NumPy, Matplotlib, Seaborn
  • Outputs:
    • flight_delay_gbt_pipeline_model/ (preprocessing pipeline)
    • flight_delay_gbt_model/ (trained GBT classifier)

Step3_Streaming_Prediction.ipynb

  • Purpose: Real-time streaming predictions with Spark Structured Streaming
  • Key Tasks:
    • Initialize Spark session with Kafka connector
    • Load saved preprocessing pipeline and GBT model
    • Connect to Kafka topic flight_data_stream
    • Parse JSON messages into structured DataFrame
    • Apply feature engineering in real-time
    • Transform data using saved pipeline
    • Generate predictions with trained model
    • Output predictions to console, Parquet files, and MongoDB
    • Maintain checkpoints for fault tolerance
  • Dependencies: PySpark, Kafka, MongoDB connector (optional)
  • Outputs:
    • streaming_predictions_output/ (Parquet files)
    • streaming_predictions_checkpoint/ (Spark checkpoints)
    • Console logs with batch predictions

Python Scripts

app.py

  • Purpose: Flask web server for real-time prediction dashboard
  • Key Features:
    • Reads Parquet files from ./streaming_predictions_output/
    • Aggregates and sorts predictions by flight date and departure time
    • Displays top 50 most recent predictions in HTML table
    • Auto-refreshes every 10 seconds
    • Color-codes predictions (red = severe delay, green = no delay)
    • Shows probability percentages
  • Technologies: Flask, Pandas, Jinja2 templating
  • Server: Runs on 0.0.0.0:5000 (accessible via http://<VM_IP>:5000)
  • Dependencies: Flask, Pandas, glob, os

kafka_producer.py

  • Purpose: Stream CSV data to Kafka topic in real-time
  • Key Features:
    • Reads flight_data.csv row by row
    • Converts CSV rows to JSON format
    • Publishes messages to Kafka topic flight_data_stream
    • Configurable streaming speed (records per second)
    • Batch flushing every 1,000 records
    • Type conversion (int, float, string) for proper data types
    • Command-line interface with argparse
  • CLI Options:
    • --csv: Path to CSV file (default: flight_data.csv)
    • --topic: Kafka topic name (default: flight_data_stream)
    • --servers: Kafka bootstrap servers (default: localhost:9092)
    • --speed: Records/second (default: 100, 0 = max speed)
  • Dependencies: kafka-python, argparse, json, csv

Dataset Application (Static Web App)

Dataset Application/index.html

  • Purpose: Interactive web interface for exploring visualizations
  • Features:
    • 7Γ—5 matrix selector for dataset visualizations
    • 3D matrix selector (2Γ—3Γ—2) for model evaluation visualizations
    • Button-based navigation (row + column selection)
    • Dynamic image loading based on selection
    • Active button highlighting (green)
    • Current selection display
    • Graceful handling of missing images
  • Dependencies: Embedded JavaScript (script.js), CSS (styles.css)

Dataset Application/script.js

  • Purpose: Client-side logic for interactive image selection
  • Features:
    • Event listeners for column, row, and 3D dimension buttons
    • Dynamic image path construction (./Visualizations/{row}{col}.png)
    • Active state management (green highlighting)
    • Image error handling (displays "Image not found")
    • Selection text updates
    • Two separate visualization systems (2D matrix and 3D matrix)
  • Technologies: Vanilla JavaScript (ES6), DOM manipulation

Dataset Application/styles.css

  • Purpose: Visual styling and layout for the web app
  • Features:
    • Flexbox-based responsive layout
    • Button hover effects and active states
    • Color-coded predictions (green for active)
    • Styled image containers with borders
    • Section dividers with gradient
    • Professional typography and spacing
    • Mobile-responsive design considerations
  • Technologies: CSS3, Flexbox

Dataset Application/Visualizations/

  • Purpose: Store pre-generated PNG visualization images
  • Contents:
    • Dataset Matrix (00.png - 46.png):
      • 7 columns Γ— 5 rows = 35 images
      • Categories: Day of Week, Month, Hour, Origin, Dest, Airline, Delay Cause
      • Metrics: Avg Delay, Flight Count, Severe Delays, Proportions, Ratios
    • Model Evaluation Matrix (800.png - 921.png):
      • 2 evaluation types Γ— 3 models Γ— 2 balancing methods = 12 images
      • Confusion matrices and evaluation metrics
      • Models: RFC, LR, GBT
      • Balancing: Weighting, Resampling

Configuration & Documentation

.gitignore

  • Purpose: Specify files/folders to exclude from Git version control
  • Typical Exclusions:
  .DS_Store
  flight_data.csv
  flight_delay_gbt_pipeline_model/
  flight_delay_gbt_model/
  streaming_predictions_output/
  streaming_predictions_checkpoint/
  mongodb_checkpoint/
  *.pyc
  __pycache__/
  .ipynb_checkpoints/
  venv/
  .env
  *.log

LICENSE

  • Type: MIT License
  • Purpose: Define open-source usage terms and permissions
  • Allows: Commercial use, modification, distribution, private use
  • Requires: License and copyright notice inclusion
  • Liability: No warranty provided

README.md

  • Purpose: Primary project documentation and setup guide
  • Contents:
    • Project overview
    • Streaming service setup instructions
    • GCP VM deployment steps
    • Kafka, Spark, Flask configuration
    • Dataset application usage
    • Step-by-step workflow
    • Access URLs and commands
  • Target Audience: Developers, data scientists, instructors

πŸ—‚οΈ Generated Artifacts (Not in Git Repository)

These files and folders are created during runtime and are excluded from version control:

Model Files

flight_delay_gbt_pipeline_model/

  • Created By: Step2_Batch_Processing.ipynb
  • Purpose: Saved Spark ML preprocessing pipeline
  • Contains:
    • StringIndexer models for categorical features (AIRLINE_CODE, ORIGIN, DEST)
    • OneHotEncoder models for categorical encoding
    • VectorAssembler for numerical features
    • StandardScaler for feature normalization
    • Final VectorAssembler combining all features
  • Size: ~10-50 MB (depending on unique categorical values)
  • Format: Spark PipelineModel (metadata + binary data)

flight_delay_gbt_model/

  • Created By: Step2_Batch_Processing.ipynb
  • Purpose: Saved trained Gradient Boosted Tree classifier
  • Contains:
    • GBT model weights and structure
    • Feature importance scores
    • Model metadata (hyperparameters, training metrics)
  • Size: ~20-100 MB (depending on tree depth and iterations)
  • Format: Spark GBTClassificationModel (metadata + binary data)

Streaming Output

streaming_predictions_output/

  • Created By: Step3_Streaming_Prediction.ipynb
  • Purpose: Store real-time predictions in Parquet format
  • Contains:
    • Multiple .parquet files (one per micro-batch)
    • Columns: FL_DATE, AIRLINE_CODE, ORIGIN, DEST, CRS_DEP_TIME, Prediction_Label, Probability_Severe_Delay
    • Partitioned by batch timestamp
  • Size: Grows continuously (1-5 MB per 1,000 records)
  • Format: Apache Parquet (columnar storage)
  • Read By: app.py (Flask dashboard)

streaming_predictions_checkpoint/

  • Created By: Step3_Streaming_Prediction.ipynb (Spark Structured Streaming)
  • Purpose: Enable fault tolerance and exactly-once processing
  • Contains:
    • Offset tracking for Kafka consumer
    • State metadata for streaming query
    • Recovery information for restarts
  • Size: ~1-10 MB
  • Format: Spark checkpoint files (binary)
  • Behavior: Automatically managed by Spark

mongodb_checkpoint/

  • Created By: Step3_Streaming_Prediction.ipynb (if using MongoDB output)
  • Purpose: Checkpoint directory for MongoDB streaming sink
  • Contains:
    • MongoDB-specific offset tracking
    • Write-ahead logs for recovery
  • Size: ~1-5 MB
  • Format: Spark checkpoint files (binary)
  • Note: Only created if MongoDB output is enabled (commented out in current code)

Dataset (Not Included in Repository)

flight_data.csv

  • Source: External flight data provider (user-supplied)
  • Purpose: Input dataset for training and streaming
  • Typical Size: 100 MB - 2 GB (depending on number of records)
  • Columns: 32 fields including flight dates, airlines, airports, delays, distances
  • Excluded from Git: Too large for version control (listed in .gitignore)
  • Must Be Provided By User: Required for all notebooks to run

πŸ”§ Dependencies & Configuration Files

Not Included (User Must Create)

requirements.txt (Optional)

pyspark==3.5.1
pandas>=1.5.0
numpy>=1.23.0
matplotlib>=3.5.0
seaborn>=0.12.0
kafka-python>=2.0.2
flask>=2.3.0
findspark>=2.0.1

Environment Variables (Optional)

# For production deployment
export SPARK_HOME=/opt/spark
export PYSPARK_PYTHON=python3
export KAFKA_HOME=/opt/kafka
export FLASK_ENV=production
export MONGODB_URI=mongodb://localhost:27017/flightdb

Embedded Configuration

Spark Configuration (in notebooks):

spark = SparkSession.builder \
    .appName("FlightDelayPrediction") \
    .config("spark.driver.memory", "16g") \
    .config("spark.executor.memory", "16g") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.1") \
    .getOrCreate()

Kafka Configuration (in kafka_producer.py):

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    linger_ms=10,
    batch_size=16384
)

Flask Configuration (in app.py):

STREAMING_OUTPUT_PATH = "./streaming_predictions_output/"
app.run(host="0.0.0.0", port=5000, debug=True)

πŸ“Š Data Flow Diagram

flight_data.csv
      ↓
kafka_producer.py β†’ Kafka Topic (flight_data_stream)
      ↓
Step3_Streaming_Prediction.ipynb
      ↓
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Load Models  β”‚ ← flight_delay_gbt_pipeline_model/
   β”‚              β”‚ ← flight_delay_gbt_model/
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      ↓
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Predictions  β”‚ β†’ streaming_predictions_output/ (Parquet)
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β†’ Console (stdout)
      ↓                β†’ MongoDB (optional)
   app.py (Flask)
      ↓
Web Dashboard (http://<VM_IP>:5000)

πŸ” File Size Reference

File/Folder Typical Size Notes
Step1_Data_Visualization.ipynb 1-3 MB Contains inline images and outputs
Step2_Batch_Processing.ipynb 2-5 MB Contains confusion matrices and charts
Step3_Streaming_Prediction.ipynb 500 KB - 1 MB Minimal outputs, mostly code
app.py 3-5 KB Small Flask server script
kafka_producer.py 4-6 KB Lightweight streaming script
Dataset Application/ 50-200 MB PNG images are largest component
flight_data.csv 100 MB - 2 GB User-provided, not in repo
flight_delay_gbt_pipeline_model/ 10-50 MB Saved preprocessing pipeline
flight_delay_gbt_model/ 20-100 MB Trained GBT classifier
streaming_predictions_output/ Growing Accumulates over time

πŸš€ Workflow Summary

  1. Data Exploration β†’ Run Step1_Data_Visualization.ipynb
  2. Model Training β†’ Run Step2_Batch_Processing.ipynb (saves models)
  3. Start Streaming β†’ Run kafka_producer.py (Terminal 1)
  4. Start Predictions β†’ Run Step3_Streaming_Prediction.ipynb (Jupyter)
  5. Start Dashboard β†’ Run app.py (Terminal 2)
  6. View Predictions β†’ Access http://<VM_IP>:5000
  7. Explore Visualizations β†’ Open Dataset Application/index.html (local machine)

πŸ’‘ Best Practices

Version Control

  • βœ… Commit: Notebooks, scripts, HTML/CSS/JS, README, LICENSE
  • ❌ Exclude: Dataset (CSV), models, Parquet outputs, checkpoints, logs
  • πŸ“ Update .gitignore to prevent accidental commits of large files

Data Management

  • Store flight_data.csv separately (Google Drive, S3, etc.)
  • Backup trained models before re-running Step2
  • Periodically clean streaming_predictions_output/ to save disk space
  • Use compression for long-term storage: tar -czf predictions.tar.gz streaming_predictions_output/

Development Workflow

  • Use virtual environments to isolate Python dependencies
  • Test notebooks on small data samples before full runs
  • Monitor disk space during streaming (Parquet files accumulate)
  • Keep separate development and production model directories

Documentation

  • Add inline comments to notebooks for clarity
  • Document custom modifications in Docs/ folder
  • Update README.md when changing deployment steps
  • Version models with timestamps (e.g., flight_delay_gbt_model_20241223/)