-
Notifications
You must be signed in to change notification settings - Fork 0
Project Structure
Matthew Smith edited this page Dec 24, 2025
·
4 revisions
FlightDelay/
βββ Dataset Application/ # Static visualization web app
β βββ index.html # Main HTML interface with matrix selector
β βββ script.js # Interactive button logic and image loading
β βββ styles.css # Styling, layout, and responsive design
β βββ Visualizations/ # Pre-generated PNG visualizations
β βββ 00.png - 46.png # 7Γ5 dataset visualization matrix
β βββ 800.png - 921.png # 2Γ3Γ2 model evaluation matrix
β
βββ Docs/ # Additional documentation (optional)
β βββ [Project reports, presentations, notes]
β
βββ Step1_Data_Visualization.ipynb # Exploratory Data Analysis (EDA)
βββ Step2_Batch_Processing.ipynb # Model training and evaluation
βββ Step3_Streaming_Prediction.ipynb # Real-time inference pipeline
β
βββ app.py # Flask web server for prediction dashboard
βββ kafka_producer.py # Kafka producer for streaming CSV data
β
βββ flight_data.csv # Input dataset (NOT in repository)
β
βββ .gitignore # Git exclusion rules
βββ LICENSE # MIT License
βββ README.md # Setup and usage instructions
Step1_Data_Visualization.ipynb
- Purpose: Exploratory Data Analysis (EDA) and data visualization
-
Key Tasks:
- Load and inspect flight dataset
- Clean data (drop nulls, handle outliers)
- Engineer temporal and derived features
- Generate descriptive statistics
- Create correlation matrix heatmap
- Visualize arrival delay distributions
- Export 35 PNG visualizations to
Dataset Application/Visualizations/
- Dependencies: PySpark, Pandas, Matplotlib, Seaborn, findspark
- Outputs: Statistical summaries, correlation insights, PNG images
Step2_Batch_Processing.ipynb
- Purpose: Train and evaluate machine learning models
-
Key Tasks:
- Load dataset with defined schema
- Preprocess data (drop irrelevant columns, handle nulls)
- Engineer features (temporal, categorical, numerical)
- Build ML pipeline (StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler)
- Train three models: GBT, Random Forest, Logistic Regression
- Handle class imbalance with weighting or resampling
- Evaluate models (AUC, accuracy, precision, recall, F1, confusion matrix)
- Extract and visualize feature importance
- Save models to disk for production use
- Dependencies: PySpark MLlib, Pandas, NumPy, Matplotlib, Seaborn
-
Outputs:
-
flight_delay_gbt_pipeline_model/(preprocessing pipeline) -
flight_delay_gbt_model/(trained GBT classifier)
-
Step3_Streaming_Prediction.ipynb
- Purpose: Real-time streaming predictions with Spark Structured Streaming
-
Key Tasks:
- Initialize Spark session with Kafka connector
- Load saved preprocessing pipeline and GBT model
- Connect to Kafka topic
flight_data_stream - Parse JSON messages into structured DataFrame
- Apply feature engineering in real-time
- Transform data using saved pipeline
- Generate predictions with trained model
- Output predictions to console, Parquet files, and MongoDB
- Maintain checkpoints for fault tolerance
- Dependencies: PySpark, Kafka, MongoDB connector (optional)
-
Outputs:
-
streaming_predictions_output/(Parquet files) -
streaming_predictions_checkpoint/(Spark checkpoints) - Console logs with batch predictions
-
app.py
- Purpose: Flask web server for real-time prediction dashboard
-
Key Features:
- Reads Parquet files from
./streaming_predictions_output/ - Aggregates and sorts predictions by flight date and departure time
- Displays top 50 most recent predictions in HTML table
- Auto-refreshes every 10 seconds
- Color-codes predictions (red = severe delay, green = no delay)
- Shows probability percentages
- Reads Parquet files from
- Technologies: Flask, Pandas, Jinja2 templating
-
Server: Runs on
0.0.0.0:5000(accessible viahttp://<VM_IP>:5000) - Dependencies: Flask, Pandas, glob, os
kafka_producer.py
- Purpose: Stream CSV data to Kafka topic in real-time
-
Key Features:
- Reads
flight_data.csvrow by row - Converts CSV rows to JSON format
- Publishes messages to Kafka topic
flight_data_stream - Configurable streaming speed (records per second)
- Batch flushing every 1,000 records
- Type conversion (int, float, string) for proper data types
- Command-line interface with argparse
- Reads
-
CLI Options:
-
--csv: Path to CSV file (default:flight_data.csv) -
--topic: Kafka topic name (default:flight_data_stream) -
--servers: Kafka bootstrap servers (default:localhost:9092) -
--speed: Records/second (default: 100, 0 = max speed)
-
- Dependencies: kafka-python, argparse, json, csv
Dataset Application/index.html
- Purpose: Interactive web interface for exploring visualizations
-
Features:
- 7Γ5 matrix selector for dataset visualizations
- 3D matrix selector (2Γ3Γ2) for model evaluation visualizations
- Button-based navigation (row + column selection)
- Dynamic image loading based on selection
- Active button highlighting (green)
- Current selection display
- Graceful handling of missing images
- Dependencies: Embedded JavaScript (script.js), CSS (styles.css)
Dataset Application/script.js
- Purpose: Client-side logic for interactive image selection
-
Features:
- Event listeners for column, row, and 3D dimension buttons
- Dynamic image path construction (
./Visualizations/{row}{col}.png) - Active state management (green highlighting)
- Image error handling (displays "Image not found")
- Selection text updates
- Two separate visualization systems (2D matrix and 3D matrix)
- Technologies: Vanilla JavaScript (ES6), DOM manipulation
Dataset Application/styles.css
- Purpose: Visual styling and layout for the web app
-
Features:
- Flexbox-based responsive layout
- Button hover effects and active states
- Color-coded predictions (green for active)
- Styled image containers with borders
- Section dividers with gradient
- Professional typography and spacing
- Mobile-responsive design considerations
- Technologies: CSS3, Flexbox
Dataset Application/Visualizations/
- Purpose: Store pre-generated PNG visualization images
-
Contents:
-
Dataset Matrix (00.png - 46.png):
- 7 columns Γ 5 rows = 35 images
- Categories: Day of Week, Month, Hour, Origin, Dest, Airline, Delay Cause
- Metrics: Avg Delay, Flight Count, Severe Delays, Proportions, Ratios
-
Model Evaluation Matrix (800.png - 921.png):
- 2 evaluation types Γ 3 models Γ 2 balancing methods = 12 images
- Confusion matrices and evaluation metrics
- Models: RFC, LR, GBT
- Balancing: Weighting, Resampling
-
Dataset Matrix (00.png - 46.png):
.gitignore
- Purpose: Specify files/folders to exclude from Git version control
- Typical Exclusions:
.DS_Store
flight_data.csv
flight_delay_gbt_pipeline_model/
flight_delay_gbt_model/
streaming_predictions_output/
streaming_predictions_checkpoint/
mongodb_checkpoint/
*.pyc
__pycache__/
.ipynb_checkpoints/
venv/
.env
*.log
LICENSE
- Type: MIT License
- Purpose: Define open-source usage terms and permissions
- Allows: Commercial use, modification, distribution, private use
- Requires: License and copyright notice inclusion
- Liability: No warranty provided
README.md
- Purpose: Primary project documentation and setup guide
-
Contents:
- Project overview
- Streaming service setup instructions
- GCP VM deployment steps
- Kafka, Spark, Flask configuration
- Dataset application usage
- Step-by-step workflow
- Access URLs and commands
- Target Audience: Developers, data scientists, instructors
These files and folders are created during runtime and are excluded from version control:
flight_delay_gbt_pipeline_model/
- Created By: Step2_Batch_Processing.ipynb
- Purpose: Saved Spark ML preprocessing pipeline
-
Contains:
- StringIndexer models for categorical features (AIRLINE_CODE, ORIGIN, DEST)
- OneHotEncoder models for categorical encoding
- VectorAssembler for numerical features
- StandardScaler for feature normalization
- Final VectorAssembler combining all features
- Size: ~10-50 MB (depending on unique categorical values)
- Format: Spark PipelineModel (metadata + binary data)
flight_delay_gbt_model/
- Created By: Step2_Batch_Processing.ipynb
- Purpose: Saved trained Gradient Boosted Tree classifier
-
Contains:
- GBT model weights and structure
- Feature importance scores
- Model metadata (hyperparameters, training metrics)
- Size: ~20-100 MB (depending on tree depth and iterations)
- Format: Spark GBTClassificationModel (metadata + binary data)
streaming_predictions_output/
- Created By: Step3_Streaming_Prediction.ipynb
- Purpose: Store real-time predictions in Parquet format
-
Contains:
- Multiple
.parquetfiles (one per micro-batch) - Columns: FL_DATE, AIRLINE_CODE, ORIGIN, DEST, CRS_DEP_TIME, Prediction_Label, Probability_Severe_Delay
- Partitioned by batch timestamp
- Multiple
- Size: Grows continuously (1-5 MB per 1,000 records)
- Format: Apache Parquet (columnar storage)
- Read By: app.py (Flask dashboard)
streaming_predictions_checkpoint/
- Created By: Step3_Streaming_Prediction.ipynb (Spark Structured Streaming)
- Purpose: Enable fault tolerance and exactly-once processing
-
Contains:
- Offset tracking for Kafka consumer
- State metadata for streaming query
- Recovery information for restarts
- Size: ~1-10 MB
- Format: Spark checkpoint files (binary)
- Behavior: Automatically managed by Spark
mongodb_checkpoint/
- Created By: Step3_Streaming_Prediction.ipynb (if using MongoDB output)
- Purpose: Checkpoint directory for MongoDB streaming sink
-
Contains:
- MongoDB-specific offset tracking
- Write-ahead logs for recovery
- Size: ~1-5 MB
- Format: Spark checkpoint files (binary)
- Note: Only created if MongoDB output is enabled (commented out in current code)
flight_data.csv
- Source: External flight data provider (user-supplied)
- Purpose: Input dataset for training and streaming
- Typical Size: 100 MB - 2 GB (depending on number of records)
- Columns: 32 fields including flight dates, airlines, airports, delays, distances
-
Excluded from Git: Too large for version control (listed in
.gitignore) - Must Be Provided By User: Required for all notebooks to run
requirements.txt (Optional)
pyspark==3.5.1
pandas>=1.5.0
numpy>=1.23.0
matplotlib>=3.5.0
seaborn>=0.12.0
kafka-python>=2.0.2
flask>=2.3.0
findspark>=2.0.1
Environment Variables (Optional)
# For production deployment
export SPARK_HOME=/opt/spark
export PYSPARK_PYTHON=python3
export KAFKA_HOME=/opt/kafka
export FLASK_ENV=production
export MONGODB_URI=mongodb://localhost:27017/flightdbSpark Configuration (in notebooks):
spark = SparkSession.builder \
.appName("FlightDelayPrediction") \
.config("spark.driver.memory", "16g") \
.config("spark.executor.memory", "16g") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.1") \
.getOrCreate()Kafka Configuration (in kafka_producer.py):
producer = KafkaProducer(
bootstrap_servers='localhost:9092',
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
linger_ms=10,
batch_size=16384
)Flask Configuration (in app.py):
STREAMING_OUTPUT_PATH = "./streaming_predictions_output/"
app.run(host="0.0.0.0", port=5000, debug=True)flight_data.csv
β
kafka_producer.py β Kafka Topic (flight_data_stream)
β
Step3_Streaming_Prediction.ipynb
β
ββββββββββββββββ
β Load Models β β flight_delay_gbt_pipeline_model/
β β β flight_delay_gbt_model/
ββββββββββββββββ
β
ββββββββββββββββ
β Predictions β β streaming_predictions_output/ (Parquet)
ββββββββββββββββ β Console (stdout)
β β MongoDB (optional)
app.py (Flask)
β
Web Dashboard (http://<VM_IP>:5000)
| File/Folder | Typical Size | Notes |
|---|---|---|
Step1_Data_Visualization.ipynb |
1-3 MB | Contains inline images and outputs |
Step2_Batch_Processing.ipynb |
2-5 MB | Contains confusion matrices and charts |
Step3_Streaming_Prediction.ipynb |
500 KB - 1 MB | Minimal outputs, mostly code |
app.py |
3-5 KB | Small Flask server script |
kafka_producer.py |
4-6 KB | Lightweight streaming script |
Dataset Application/ |
50-200 MB | PNG images are largest component |
flight_data.csv |
100 MB - 2 GB | User-provided, not in repo |
flight_delay_gbt_pipeline_model/ |
10-50 MB | Saved preprocessing pipeline |
flight_delay_gbt_model/ |
20-100 MB | Trained GBT classifier |
streaming_predictions_output/ |
Growing | Accumulates over time |
-
Data Exploration β Run
Step1_Data_Visualization.ipynb -
Model Training β Run
Step2_Batch_Processing.ipynb(saves models) -
Start Streaming β Run
kafka_producer.py(Terminal 1) -
Start Predictions β Run
Step3_Streaming_Prediction.ipynb(Jupyter) -
Start Dashboard β Run
app.py(Terminal 2) -
View Predictions β Access
http://<VM_IP>:5000 -
Explore Visualizations β Open
Dataset Application/index.html(local machine)
- β Commit: Notebooks, scripts, HTML/CSS/JS, README, LICENSE
- β Exclude: Dataset (CSV), models, Parquet outputs, checkpoints, logs
- π Update
.gitignoreto prevent accidental commits of large files
- Store
flight_data.csvseparately (Google Drive, S3, etc.) - Backup trained models before re-running Step2
- Periodically clean
streaming_predictions_output/to save disk space - Use compression for long-term storage:
tar -czf predictions.tar.gz streaming_predictions_output/
- Use virtual environments to isolate Python dependencies
- Test notebooks on small data samples before full runs
- Monitor disk space during streaming (Parquet files accumulate)
- Keep separate development and production model directories
- Add inline comments to notebooks for clarity
- Document custom modifications in
Docs/folder - Update README.md when changing deployment steps
- Version models with timestamps (e.g.,
flight_delay_gbt_model_20241223/)