-
Overview
-
Amazon Product Search
-
Spark Implementation of Amazon Product Search
-
Flight Delay Prediction
This project is divided into three main problems:
- Amazon Product Search: Download, preprocess, and analyze product data from Amazon. Build a search engine using an inverted index and cosine similarity.
- Spark Implementation of Problem 1: Implement the same functionality as stated in 1 using Apache Spark.
- Flight Delay Prediction: Use PySpark’s MLlib to build, tune, and evaluate machine learning models predicting flight delays.
product_search_and_flight_delay_ml/
├── data/
│ └── raw/
│ ├── computer_results_default.tsv
│ ├── dictionary.html
│ └── flights_sample_3m.csv
├── files/
│ ├── description/
│ │ └── homework2.pdf
│ └── eda/
│ ├── Average Departure and Arrival Delays by Airline.png
│ ├── Distribution of Arrival Delays.png
│ ├── Correlation Heatmap for Numerical Features.png
│ └── ...
├── Problem1/
│ ├── analysis/
│ │ ├── LDAAnalyzer.py
│ │ └── WordFrequencyAnalyzer.py
│ ├── scraping/
│ │ └── AmazonScraper.py
│ ├── search/
│ │ └── SearchEngine.py
│ └── text_processing/
│ └── TextPreprocessor.py
├── Problem2/
│ ├── SparkLDAAnalyzer.py
│ ├── SparkPreprocessing.py
│ └── SparkSearchEngine.py
├── Problem3/
│ ├── analysis/
│ │ └── FlightDataAnalyzer.py
│ ├── data_preparation/
│ │ └── FlightDataLoader.py
│ ├── evaluation/
│ │ ├── ModelEvaluator.py
│ │ └── Visualizer.py
│ └── ml_models/
│ ├── LogisticRegressionModel.py
│ ├── RandomForestModel.py
│ ├── NeuralNetworkModel.py
│ └── GradientBoostedTreesModel.py
├── LICENSE
├── README.md
├── main_amazon.py
├── main_flight.py
└── requirements.txt
This section details the first part of the project: scraping product data from Amazon, preprocessing it, and building a search engine with an inverted index and cosine similarity.
To gather and process data on Amazon products for the keyword "computer". The task involves:
- Scraping product details such as description, price, prime status, URL, ratings, and reviews.
- Preprocessing textual descriptions for efficient analysis and search.
- Building a search engine to allow users to query products based on textual descriptions, ranked by relevance.
AmazonScraper class handles the extraction of product data using Python libraries:
- Requests: To send HTTP requests to Amazon and fetch HTML pages.
- BeautifulSoup: For parsing HTML and extracting relevant data.
- Random Delays: Introduced using
time.sleep()
to avoid being blocked by Amazon.
Data Captured:
- Description: A brief overview of the product.
- Price: Extracted and converted into a float.
- Prime Status: Boolean indicating Prime eligibility.
- URL: The product's page link.
- Rating: Average star rating.
- Reviews: Number of customer reviews.
The scraped data is saved as a .tsv
file in the data/raw/
directory.
Code Reference: See AmazonScraper.py
for full implementation.
TextPreprocessor class processes product descriptions for analysis:
- Tokenization: Splitting text into individual words or tokens.
- Stopword Removal: Eliminating common, uninformative words.
- Lemmatization and Stemming: Reducing words to their base or root form.
- Multi-word Term Preservation: Retaining phrases like "Windows 10" as a single token.
Purpose:
- Ensure uniformity in data representation.
- Prepare descriptions for use in the search engine and topic modeling.
Code Reference: See TextPreprocessor.py
for details.
WordFrequencyAnalyzer calculates and visualizes:
- Word Frequencies: Most common terms in product descriptions.
- Bigrams and Trigrams: Frequent two- and three-word combinations.
Visualizations include bar charts for top terms to help understand the dataset's content.
Code Reference: See WordFrequencyAnalyzer.py
for implementation.
SearchEngine uses TF-IDF (Term Frequency-Inverse Document Frequency) to rank products based on cosine similarity:
- Inverted Index: Maps terms to product descriptions containing them.
- Query Processing: Allows users to search for products using keywords.
- Ranking: Returns results ranked by relevance, considering query-document similarity.
Key Features:
- Supports unigram, bigram, and trigram matches.
- Filters results below a minimum relevance threshold.
Code Reference: See SearchEngine.py
.
LDAAnalyzer applies Latent Dirichlet Allocation (LDA) to identify themes in product descriptions:
- Groups similar products into topics.
- Displays top words in each topic and visualizes the results.
Code Reference: See LDAAnalyzer.py
.
-
Install the dependencies required for the project using the
requirements.txt
file:pip install -r requirements.txt
-
Execute the main script
main_amazon.py
with your desired options. Below are example commands and the expected functionality:
The following section presents each command clearly with sample console output and relevant images.
The following commands illustrate various functionalities for the Amazon Product Search task. Each command includes sample console output and images demonstrating the results.
To scrape Amazon products for specific keyword(s) (e.g., "laptop, pc") and save the results to a .tsv
file:
python main_amazon.py --scrape --keyword "laptop, pc" --num_pages 3
Console Output:
Scraping Amazon products...
Scraping keyword laptop...
Scraping page 1...
Scraping page 2...
Scraping page 3...
216 products found
Scraping keyword pc...
Scraping page 1...
Scraping page 2...
Scraping page 3...
432 products found
Data saved to file: data/raw/laptop_results_2024-11-15.tsv
This command scrapes 3 pages of results for each keyword ("laptop" and "pc") and saves them
in data/raw/laptop_results_2024-11-15.tsv
.
The naming convetion for this tsv file is: <first_keyword>_<current_date>
To load a pre-scraped dataset and preprocess the product descriptions, use the --path
option:
python main_amazon.py --path data/raw/laptop_results_2024-11-15.tsv --keyword "laptop, pc"
Console Output:
Loading Amazon data from data/raw/laptop_results_2024-11-15.tsv...
Data loaded successfully.
Preprocessing data with standard processing...
This command loads and preprocesses data from the specified .tsv
file without additional scraping.
To analyze word, bigram, and trigram frequencies and generate visualizations for the most common terms, run:
python main_amazon.py --path data/raw/laptop_results_2024-11-15.tsv --plot_frequency --top_words 10 --top_bigrams 10 --top_trigrams 10
Console Output:
Loading Amazon data from data/raw/laptop_results_2024-11-15.tsv...
Data loaded successfully.
Preprocessing data with standard processing...
Running word frequency analysis and plotting...
Top words by frequency: [('notebook', 59), ('display', 39), ('intel', 38), ...]
Top bigrams by frequency: [(('display', 'fhd'), 10), (('display', 'full_hd'), 10), ...]
Top trigrams by frequency: [(('libre_off', 'pronto', 'alluso'), 5), ('notebook', 'alluminio', 'monitor'), ...]
Visualizations:
These images illustrate the most frequent words, bigrams, and trigrams found in the product descriptions.
To perform Latent Dirichlet Allocation (LDA) for topic modeling and identify common themes within the product descriptions, run:
python main_amazon.py --path data/raw/laptop_results_2024-11-15.tsv --run_lda --num_topics 5 --passes 15
Console Output:
Loading Amazon data from data/raw/laptop_results_2024-11-15.tsv...
Data loaded successfully.
Preprocessing data with standard processing...
Running LDA topic modeling...
Topic 1: 0.037*"display" + 0.036*"garanzia" + ...
Topic 2: 0.035*"intel" + 0.027*"wifi" + ...
...
Visualization:
This image visualizes the extracted topics and their top words.
To search for products related to a specific query and retrieve the top 5 results based on cosine similarity:
python main_amazon.py --path data/raw/laptop_results_2024-11-15.tsv --run_search --query "Intel Core i7 SSD HD Ram 16Gb" --top_k 5
Console Output:
Loading Amazon data from data/raw/laptop_results_2024-11-15.tsv...
Data loaded successfully.
Preprocessing data with standard processing...
Using non-Spark (original) search engine...
Indexing complete.
Top search results:
Document ID: 5, Score: 0.2735, Description: pc fisso computer desktop intel_core_i7 intel hd masterizz wifi interno ...
Document ID: 173, Score: 0.1756, Description: jumper computer portatile hd display office_365 ...
...
The command displays the top 5 search results with their relevance scores and descriptions.
- If the
--path
argument is omitted, the program defaults to loading data fromdata/raw/computer_results_default.tsv
. - Use the
--top_k
flag to specify the number of search results returned.
This project has been implemented on WindowsOS and the spark configuration on Windows has been done based on How to Install Apache Spark on Windows
Assume we have scraped the products using the commands described in Scraping Data section.
So, we have a .tsv
file in the data/raw
directory. We call it our "default" scraped data.
To incorporate Spark into the previous task, we can do 2 main things:
Spark's distributed capabilities allow us to preprocess product descriptions efficiently, even for large datasets.
The preprocessing steps include tokenization, removal of stopwords, lemmatization, and more, as defined in
the preprocess_with_pyspark
function.
Implementation Steps: - Convert the product dataset into a Spark DataFrame. - Apply text preprocessing using Spark's UDFs combined with NLTK. - Export the processed descriptions for downstream tasks.
Code Reference: SparkPreprocessing.py
provides the implementation of these preprocessing functions using
PySpark and NLTK.
Using Spark for building a search engine involves indexing product descriptions and efficiently handling queries. This is done using: - TF-IDF for feature extraction. - Cosine similarity for query matching. - An inverted index for optimizing search performance.
Implementation Steps: - Tokenize and preprocess product descriptions. - Use Spark MLlib's TF-IDF implementation to transform tokens into numerical vectors. - Calculate cosine similarity between query vectors and product vectors to rank search results.
Code Reference: The class SparkSearchEngine
in SparkSearchEngine.py
provides an implementation of these
features.
Below are some example commands for various tasks, along with sample output and images of results.
To preprocess and search the data using Spark, use the following command:
python main_amazon.py --use_pyspark --run_search --keyword "laptop, pc" --query "HP Notebook G9 Intel i3-1215u 6 Core 4,4 Ghz 15,6 Full Hd, Ram 16Gb Ddr4, Ssd Nvme 756Gb M2, Hdmi, Usb 3.0, Wifi, Lan,Bluetooth, Webcam, Windows 11 Professional,Libre Office" --top_k 5
Output:
Loading Amazon data from data/raw/computer_results_default.tsv...
Data loaded successfully.
Preprocessing data with PySpark...
Using Spark for search and indexing...
Top search results:
Document ID: 371, Score: 0.5037001371383667, Description: hp notebook g9 intel i31215u 6 core 44 ghz 156 full hd ram 16gb ddr4 ssd nvme 756gb m2 hdmi usb 30 wifi lan bluetooth webcam window 11 professional libre office
Document ID: 6, Score: 0.5037001371383667, Description: hp notebook g9 intel i31215u 6 core 44 ghz 156 full hd ram 16gb ddr4 ssd nvme 756gb m2 hdmi usb 30 wifi lan bluetooth webcam window 11 professional libre office
Document ID: 655, Score: 0.5037001371383667, Description: hp notebook g9 intel i31215u 6 core 44 ghz 156 full hd ram 16gb ddr4 ssd nvme 756gb m2 hdmi usb 30 wifi lan bluetooth webcam window 11 professional libre office
Document ID: 147, Score: 0.4100857973098755, Description: notebook hp g9 intel i31215u 6 core 44 ghz 156 full hd ram 8gb ddr4 ssd nvme 256gb m2 hdmi usb 30 wifi lan bluetooth webcam window 11 professional libre office
Document ID: 436, Score: 0.4100857973098755, Description: notebook hp g9 intel i31215u 6 core 44 ghz 156 full hd ram 8gb ddr4 ssd nvme 256gb m2 hdmi usb 30 wifi lan bluetooth webcam window 11 professional libre office
To generate the most common words, bigrams, and trigrams, use the following command:
python main_amazon.py --use_pyspark --keyword "laptop, pc" --plot_frequency --top_words 10 --top_bigrams 10 --top_trigrams 10
Output:
Top 10 Most Common Words**: [('pc', 561), ('ssd', 557), ('ram', 496), ('11', 427), ('pro', 414), ('computer', 397), ('window', 366), ('amd', 342), ('intel', 330), ('core', 319)]
Top 10 Most Common Bigrams**: [(('window', '11'), 276), (('intel', 'core'), 239), (('11', 'pro'), 237), (('pc', 'portatile'), 229), (('amd', 'ryzen'), 222), (('display', '156'), 183), (('core', 'i5'), 167), (('1', 'tb'), 157), (('win', '11'), 151), (('ryzen', '5'), 146)]
Top 10 Most Common Trigrams**: [(('intel', 'core', 'i5'), 163), (('amd', 'ryzen', '5'), 146), (('156', 'full', 'hd'), 137), (('window', '11', 'pro'), 134), (('processore', 'amd', 'ryzen'), 118), (('core', 'i5', '12th'), 103), (('display', '156', 'full'), 103), (('win', '11', 'pro'), 103), (('da', '1', 'tb'), 103), (('window', '11', 'home'), 102)]
Images:
To perform topic modeling on the data using LDA, run the following command:
python main_amazon.py --use_pyspark --keyword "laptop, pc" --run_lda --num_topics 5 --passes 15
Output:
Loading Amazon data from data/raw/computer_results_default.tsv...
Data loaded successfully.
Preprocessing data with PySpark...
Running LDA topic modeling...
Topic 1: 0.043*"da" + 0.036*"pro" + 0.035*"wifi" + 0.033*"gb" + 0.033*"1" + 0.033*"tb" + 0.031*"intel" + 0.027*"hdmi" + 0.024*"core" + 0.022*"fisso"
Topic 2: 0.038*"1tb" + 0.036*"mini" + 0.030*"mouse" + 0.027*"tastiera" + 0.024*"portatile" + 0.023*"wifi" + 0.023*"pollici" + 0.022*"win" + 0.022*"notebook" + 0.022*"14"
Topic 3: 0.058*"pro" + 0.043*"intel" + 0.037*"hp" + 0.035*"i5" + 0.035*"core" + 0.034*"250" + 0.034*"portatile" + 0.028*"16gb" + 0.026*"office" + 0.026*"g9"
Topic 4: 0.063*"amd" + 0.042*"gb" + 0.037*"processore" + 0.032*"radeon" + 0.032*"ddr5" + 0.032*"ryzen" + 0.031*"8" + 0.027*"home" + 0.027*"display" + 0.021*"mini"
Topic 5: 0.124*"scrivania" + 0.075*"con" + 0.050*"per" + 0.050*"di" + 0.050*"cm" + 0.026*"mouse" + 0.025*"156" + 0.025*"led" + 0.025*"desktop" + 0.025*"gaming"
Image:
This section describes the steps and methodology for predicting flight delays using PySpark, starting from data preprocessing, exploratory data analysis (EDA), feature engineering, and finally building machine learning models for binary classification.
Given a dataset of flights, create model predicting whether a flight is delayed by more than 15 minutes.
To be more precise, we will create a model, predicting if a flight is going to depart with more than 15 minutes of delay.
To work with the flight delay dataset, you first need to download and load it into your environment. Follow these steps:
To download the dataset, execute the following command:
python main_flight.py download
When you pass the download
action to the main_flight.py
script, the flight dataset is fetched using the Kaggle API
and stored in the data/raw
directory. After a successful download, the files are extracted, and you will find the
following in the data/raw
directory:
flights_sample_3m.csv
: The actual dataset containing flight delay and cancellation information.dictionary.html
: A metadata file with general information about the dataset, including column descriptions and data types.
After executing the command, you will see this output in the console:
Executing action: download
Starting download from Kaggle...
Dataset URL: https://www.kaggle.com/datasets/patrickzel/flight-delay-and-cancellation-dataset-2019-2023
Dataset downloaded and extracted to data/
To load the dataset into your program, execute:
python main_flight.py load
By default, this command loads the file located at data/raw/flights_sample_3m.csv
. Once downloaded, there is no need
to download the dataset again unless the file is deleted or a new version is required.
The Kaggle API is used to download the dataset. To enable its usage:
- Create an account on Kaggle and generate an API Token (
kaggle.json
). - Place the
kaggle.json
file in the following location:- Linux/Mac:
~/.kaggle/kaggle.json
- Windows:
%HOMEPATH%\.kaggle\kaggle.json
- Linux/Mac:
If the kaggle.json
file is not set up correctly, you will encounter the following error:
OSError: Could not find kaggle.json. Make sure it's located in ~/.kaggle. Or use the environment method. See setup instructions at https://github.com/Kaggle/kaggle-api/
- Once downloaded, you do not need to re-download the dataset unless explicitly required.
- The
load
command always loads thedata/raw/flights_sample_3m.csv
file for subsequent analysis or processing.
The first thing that we do when we deal with a ML problem is analyzing the data. In this section we perform a thorough investigation of the data and its structure.
-
Departure Delays:
Observations:
- Most flights have minimal or no departure delays, as evident from the sharp peak near zero.
- A small proportion of flights exhibit significant delays exceeding 500 minutes.
- The mean departure delay is marked in the plot, providing a central tendency for delays.
-
Arrival Delays:
Observations:
- The arrival delay distribution mirrors the departure delay distribution, with most flights clustered around minimal or no delay.
- A small number of flights experience extreme arrival delays.
- The mean arrival delay is also highlighted for better interpretability.
-
Average Delays by Airline:
Observations:
- Airlines such as Allegiant Air and Frontier Airlines have the highest average delays, while others like * Endeavor Air* exhibit minimal delays.
- Departure delays tend to be slightly higher than arrival delays for most airlines.
-
Cancellation Rates by Airline:
Observations:
- Airlines such as Frontier Airlines and Southwest Airlines have relatively high cancellation rates compared to others.
- Airlines with low delays may still exhibit notable cancellation rates, highlighting different operational challenges.
-
Delays by Hour of the Day:
Observations:
- Delays are more frequent during the early morning and late evening hours, likely due to congestion and operational inefficiencies.
- Mid-day flights tend to have shorter delays, reflecting smoother operations during these hours.
-
Delays by Day of the Week:
Observations:
- Delays are higher towards the end of the week, peaking on Thursdays and Fridays.
- This could be attributed to higher travel volumes and operational strain during these days.
-
Observations:
- Departure and arrival delays are highly correlated, indicating that departure delays directly contribute to arrival delays.
- Features like distance and weather-related delays show moderate correlations with the delay metrics, providing insights into contributing factors.
-
Observations:
- The most common cancellation reasons include carrier-related issues and weather conditions.
- Security-related cancellations are comparatively rare, reflecting their infrequent occurrence.
After splitting the dataset into train (80%) and test (20%) sets, it's important to verify the distribution of labels in both subsets to ensure consistency and fairness in model evaluation. Note that due to limited computational resources, only 50% of the dataset was used for this analysis.
+-----+------+
|label| count|
+-----+------+
| 1|206406|
| 0|996014|
+-----+------+
- The training set contains 206,406 flights labeled as delayed (label = 1) and 996,014 flights labeled as non-delayed (label = 0).
- This results in a 17.2% delay rate, with a roughly consistent proportion of delayed to non-delayed flights.
+-----+------+
|label| count|
+-----+------+
| 1| 51390|
| 0|249059|
+-----+------+
- The test set contains 51,390 delayed flights (label = 1) and 249,059 non-delayed flights (label = 0).
- The delay rate is similar to that of the training set, ensuring consistency in the distribution.
- Consistent Proportions: Both train and test sets have nearly identical proportions of delayed to non-delayed flights, ensuring that the model is trained and evaluated under comparable conditions.
- Impact of Imbalance: Although the dataset is imbalanced, the consistent proportions across subsets reduce the potential impact of this imbalance on model training and evaluation. Many machine learning algorithms, particularly those using cross-validation and metrics like AUC, are resilient to such imbalances. Moreover, this is actually what we see (or we hope) to happen in the real-world. most of the flights are departed on time and only a small portion of them are delayed.
To ensure data integrity, the script performs the following actions for identifying and managing missing values:
You can identify missing data by using the check_missing
command:
python main_flight.py load check_missing
This command computes and displays the number of NaN
or null
values for each column in the dataset. For instance,
the output below indicates the presence of missing values in several columns such
as DEP_TIME
, DEP_DELAY
, ARR_TIME
, and specific delay-related columns like DELAY_DUE_CARRIER
:
Executing action: check_missing
+-------+-------+-----------+------------+--------+---------+------+-----------+----+---------+------------+--------+---------+--------+----------+---------+-------+------------+--------+---------+---------+-----------------+--------+----------------+------------+--------+--------+-----------------+-----------------+-------------+------------------+-----------------------+
|FL_DATE|AIRLINE|AIRLINE_DOT|AIRLINE_CODE|DOT_CODE|FL_NUMBER|ORIGIN|ORIGIN_CITY|DEST|DEST_CITY|CRS_DEP_TIME|DEP_TIME|DEP_DELAY|TAXI_OUT|WHEELS_OFF|WHEELS_ON|TAXI_IN|CRS_ARR_TIME|ARR_TIME|ARR_DELAY|CANCELLED|CANCELLATION_CODE|DIVERTED|CRS_ELAPSED_TIME|ELAPSED_TIME|AIR_TIME|DISTANCE|DELAY_DUE_CARRIER|DELAY_DUE_WEATHER|DELAY_DUE_NAS|DELAY_DUE_SECURITY|DELAY_DUE_LATE_AIRCRAFT|
+-------+-------+-----------+------------+--------+---------+------+-----------+----+---------+------------+--------+---------+--------+----------+---------+-------+------------+--------+---------+---------+-----------------+--------+----------------+------------+--------+--------+-----------------+-----------------+-------------+------------------+-----------------------+
| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 77615| 77644| 78806| 78806| 79944| 79944| 0| 79942| 86198| 0| 2920860| 0| 14| 86198| 86198| 0| 2466137| 2466137| 2466137| 2466137| 2466137|
+-------+-------+-----------+------------+--------+---------+------+-----------+----+---------+------------+--------+---------+--------+----------+---------+-------+------------+--------+---------+---------+-----------------+--------+----------------+------------+--------+--------+-----------------+-----------------+-------------+------------------+-----------------------+
After identifying missing data, the handle_missing_values
function is executed to clean and preprocess the dataset.
This function applies the following strategies:
-
Imputation for Delays:
- Missing values in
DEP_DELAY
andARR_DELAY
are imputed with0.0
, assuming that missing delay information indicates no delay. - Example: If a row lacks departure delay data, it will be filled with
0.0
.
- Missing values in
-
Dropping Rows with Critical Missing Information:
- Rows with missing values in essential columns such as
AIRLINE
,ORIGIN
,DEST
,CRS_DEP_TIME
,DISTANCE
,etc. are dropped. - These columns are fundamental for modeling delays; missing values here could compromise the quality of predictions.
- Rows with missing values in essential columns such as
-
Imputation for Delay-Related Columns:
- Columns representing specific types of
delays (
DELAY_DUE_CARRIER
,DELAY_DUE_WEATHER
,DELAY_DUE_NAS
,DELAY_DUE_SECURITY
,DELAY_DUE_LATE_AIRCRAFT
) are imputed with0.0
, assuming no delay occurred if the value is missing. - Example: If
DELAY_DUE_CARRIER
is missing, it will be set to0.0
.
- Columns representing specific types of
delays (
After addressing missing values, the next steps involve preparing the dataset for machine learning by performing * feature engineering* and binary label creation:
-
Feature Selection:
- Relevant features for delay prediction are selected, including:
- Time-related features:
CRS_DEP_TIME
(scheduled departure time) andDISTANCE
(flight distance). - Delay-related causes:
DELAY_DUE_CARRIER
,DELAY_DUE_WEATHER
,DELAY_DUE_NAS
,DELAY_DUE_SECURITY
, andDELAY_DUE_LATE_AIRCRAFT
. - Categorical features:
AIRLINE
,ORIGIN
, andDEST
.
- Time-related features:
- Relevant features for delay prediction are selected, including:
-
Categorical Feature Encoding:
- Categorical columns (
AIRLINE
,ORIGIN
,DEST
) are processed using:- StringIndexer: Converts categorical string values to numerical indices.
- OneHotEncoder: Converts the indexed values into binary vectors for use in machine learning.
- Categorical columns (
-
Feature Assembly:
- The selected numerical features and the encoded categorical vectors are combined into a single feature vector
using
VectorAssembler
. - This unified representation (
features
) is essential for model training.
- The selected numerical features and the encoded categorical vectors are combined into a single feature vector
using
But why these features (columns) are selected and others are rejected?
The reason is that these features provide a balance of temporal, spatial, and causal information, which are crucial for predicting delays.
- Time and distance account for operational and logistical factors affecting delays.
- Categorical features capture unique characteristics tied to airlines and airports.
- The delay-related columns directly explain known factors contributing to delays, making them predictive.
To know why each of these features are selected from my POV:
-
Time-related Features:
CRS_DEP_TIME
(scheduled departure time): Helps capture patterns in delays based on time of day.DISTANCE
(flight distance): Longer flights may face different delay dynamics compared to shorter flights.
-
Delay-related Causes:
DELAY_DUE_CARRIER
,DELAY_DUE_WEATHER
,DELAY_DUE_NAS
,DELAY_DUE_SECURITY
, andDELAY_DUE_LATE_AIRCRAFT
: These columns provide detailed reasons for delays, directly informing the prediction model.
-
Categorical Features:
AIRLINE
,ORIGIN
, andDEST
: These features capture airline-specific, origin-specific, and destination-specific patterns in delays.
Because there is no column in the dataset indicating the "greater or equal to 15 minutes of departure delay", we need to create this column. How?
We create another column for each row named label
which is True (1) if the departure was delayed by 15 minutes and
False (0) else.
This binary label enables the model to focus on identifying significant delays, simplifying the classification task into
a binary decision-making problem.
To train and evaluate the models run the following commands:
Logistic Regression:
python main_flight.py load train_evaluate_logistic_regression
Random Forest:
python main_flight.py load train_evaluate_random_forest
Neural Network:
python main_flight.py load train_evaluate_neural_network
Gradient Boosted Trees:
python main_flight.py load train_evaluate_gradient_boosted_trees
This section describes the training, tuning, and evaluation process for the Logistic Regression, Random Forest, Neural Network, and Gradient Boosted Tree models used in this project to predict flight delays.
When any of the train_evaluate
commands is executed, the following steps occur:
-
Hyperparameter Tuning:
- Logistic Regression:
- Regularization parameter (
regParam
) is tuned to balance overfitting and underfitting. - Elastic Net mixing ratio (
elasticNetParam
) is tuned to combine L1 and L2 regularization for optimal performance.
- Regularization parameter (
- Random Forest:
- Number of trees (
numTrees
) and maximum depth of trees (maxDepth
) are tuned to improve performance while avoiding overfitting.
- Number of trees (
- Neural Network:
- Learning rate (
lr
) and number of epochs are set to ensure convergence during training. - Network architecture (number of layers and neurons) is predefined to balance complexity and computational requirements.
- Learning rate (
- Gradient Boosted Trees:
- Number of boosting iterations, learning rate, and maximum depth of individual trees are tuned to optimize accuracy and generalization.
- Logistic Regression:
-
Cross-Validation:
- Five-fold cross-validation is applied during model training to select the best hyperparameter settings based on the Area Under the ROC Curve (AUC).
-
Final Training:
- The best-performing model configuration from cross-validation is trained on the entire training dataset.
-
Prediction on Test Data:
- The trained model is used to make predictions on the test dataset for evaluation.
-
Metrics Computed:
- The following key metrics are calculated to evaluate model performance:
- AUC (Area Under the Curve): Measures the ability to distinguish between delayed and non-delayed flights.
- Accuracy: Represents the overall percentage of correct predictions.
- Precision: Indicates how many predicted delays are actually delayed flights.
- Recall: Measures the proportion of actual delayed flights correctly predicted.
- F1-Score: A harmonic mean of precision and recall, providing a balanced measure.
- A confusion matrix is generated to provide detailed insights into true positives, true negatives, false positives, and false negatives.
- The following key metrics are calculated to evaluate model performance:
-
Model-Specific Analysis:
- Logistic Regression: Evaluated using its ability to provide linear insights and balance between precision and recall.
- Random Forest: Feature importance is computed to identify key factors contributing to delays.
- Neural Network: Assessed based on its ability to capture complex non-linear patterns in data, as indicated by high AUC and accuracy.
- Gradient Boosted Trees: Emphasized for its robustness and resilience to overfitting, achieving exceptional metrics across the board.
-
ROC Curve:
- The ROC curve is plotted for each model to visualize its trade-off between true positive rate (TPR) and false positive rate (FPR) across various thresholds.
A neural network model was introduced for flight delay prediction, providing a more flexible and powerful approach to capture complex, non-linear relationships between features and labels.
This model training and evaluation can be executed with the following command:
python main_flight.py load train_evaluate_neural_network
- Input Layer: Takes in the feature vector (
input_dim
). - Hidden Layers:
- First layer: 128 neurons with ReLU activation.
- Second layer: 64 neurons with ReLU activation.
- Output Layer: A single neuron with a Sigmoid activation function to produce probabilities for binary classification.
- Loss Function: Binary Cross-Entropy Loss (BCELoss), which is suitable for binary classification tasks.
- Optimizer: Adam optimizer with a learning rate of
0.001
.
- The model was trained for 10 epochs with a batch size of 32.
- Loss values per epoch:
Epoch 1/10, Loss: 0.21436312794685364 Epoch 2/10, Loss: 0.12203751504421234 Epoch 3/10, Loss: 0.1563258320093155 Epoch 4/10, Loss: 0.203612819314003 Epoch 5/10, Loss: 0.14350281655788422 Epoch 6/10, Loss: 0.14558719098567963 Epoch 7/10, Loss: 0.13873544335365295 Epoch 8/10, Loss: 0.1565149873495102 Epoch 9/10, Loss: 0.29420629143714905 Epoch 10/10, Loss: 0.16712436079978943
A little literature about the models we used in this project. What are they used and how they should be evaluated?
-
Logistic Regression:
- Suitable for linear relationships between features and the label.
- Highly interpretable and computationally efficient, making it ideal for simpler datasets or use cases where explainability is crucial.
- Evaluated using AUC, accuracy, precision, recall, and F1-score, reflecting its ability to classify delayed and non-delayed flights accurately.
-
Random Forest:
- Effective for capturing non-linear relationships and interactions between features.
- Offers interpretability through feature importance scores, helping to identify key factors influencing flight delays.
- Robust to overfitting and useful for complex datasets.
- Evaluated using the same metrics as Logistic Regression, with additional insights from feature importance analysis.
-
Neural Network:
- Handles non-linear relationships more effectively than Logistic Regression or Random Forest.
- Flexible architecture allows for complex patterns to be captured and enables future extensions, such as incorporating new features or tuning hyperparameters.
- Requires more computational resources and is less interpretable than Logistic Regression or Random Forest.
- Evaluated using AUC, accuracy, precision, recall, and F1-score, with emphasis on balancing precision and recall due to its higher flexibility.
-
Gradient Boosted Trees:
- Combines the strengths of multiple decision trees through an iterative boosting process, resulting in high accuracy and resilience to overfitting.
- Excels at capturing complex patterns in data while maintaining interpretability through feature importance analysis.
- Evaluated using the same metrics, with particular strength in achieving high AUC and F1-scores.
Finally, it is time to evaluate and compare our Logistic Regression and Random Forest models
-
Performance Metrics:
AUC: 0.9235 Accuracy: 91.60% Precision: 92.17% Recall: 91.60% F1-Score: 90.50%
- The high AUC (0.9235) indicates that the Logistic Regression model is effective at distinguishing between delayed and non-delayed flights.
- The Accuracy (91.60%) suggests the model is highly accurate at predicting delays.
- Precision (92.17%) shows that the model is good at minimizing false positives, meaning most flights predicted as delayed are actually delayed.
- The Recall (91.60%) reflects that the model correctly identifies most delayed flights, although a small fraction is missed.
- The F1-Score (90.50%) balances precision and recall, confirming the model's strong overall performance.
-
Confusion Matrix:
+-----+----------+------+ |label|prediction| count| +-----+----------+------+ | 0| 0.0|248525| | 0| 1.0| 534| | 1| 0.0| 24714| | 1| 1.0| 26676| +-----+----------+------+
- True Negatives (248,525): Non-delayed flights correctly identified.
- False Positives (534): Flights incorrectly predicted as delayed.
- True Positives (26,676): Delayed flights correctly identified.
- False Negatives (24,714): Delayed flights incorrectly predicted as non-delayed.
- Interpretation:
- The model performs well with very few false positives but struggles slightly with false negatives, which could impact its ability to catch all delays.
-
ROC Curve Visualization:
-
Performance Metrics:
AUC: 0.9229 Accuracy: 90.45% Precision: 91.04% Recall: 90.45% F1-Score: 88.99%
- The AUC (0.9229) is nearly identical to Logistic Regression, indicating similar capability in distinguishing delays.
- The Accuracy (90.45%) is slightly lower than Logistic Regression.
- Precision (91.04%) remains high, but lower than Logistic Regression, showing more false positives.
- Recall (90.45%) is also slightly lower, meaning some delayed flights are missed.
- The F1-Score (88.99%) suggests a slight drop in overall balance compared to Logistic Regression.
-
Confusion Matrix:
+-----+----------+------+ |label|prediction| count| +-----+----------+------+ | 0| 0.0|248201| | 0| 1.0| 858| | 1| 0.0| 27829| | 1| 1.0| 23561| +-----+----------+------+
- True Negatives (248,201): Non-delayed flights correctly identified.
- False Positives (858): More flights incorrectly predicted as delayed compared to Logistic Regression.
- True Positives (23,561): Fewer delayed flights correctly identified compared to Logistic Regression.
- False Negatives (27,829): Higher number of delayed flights missed compared to Logistic Regression.
- Interpretation:
- Random Forest is slightly less precise and has a higher false negative rate than Logistic Regression, leading to fewer delayed flights being captured.
-
Feature Importance Visualization:
- The Random Forest model allows interpretation of feature importance, which can guide understanding of the factors most predictive of delays.
The neural network was evaluated on the test dataset, and the following metrics were observed:
- AUC: 0.926
- Accuracy: 95.51%
- Precision: 95.24%
- Recall: 77.46%
- F1-Score: 85.43%
-
Performance Metrics:
AUC: 0.9305 Accuracy: 95.67% Precision: 95.74% Recall: 95.67% F1-Score: 95.46%
- The AUC (0.9305) is the highest among all models, showing exceptional distinction between delayed and non-delayed flights.
- Accuracy (95.67%) surpasses all other models.
- Precision (95.74%) indicates a strong ability to avoid false positives.
- Recall (95.67%) demonstrates the best ability to capture delayed flights.
- F1-Score (95.46%) shows the best balance between precision and recall.
-
Confusion Matrix:
+-----+----------+------+ |label|prediction| count| +-----+----------+------+ | 0| 0.0|247966| | 0| 1.0| 1093| | 1| 0.0| 11922| | 1| 1.0| 39468| +-----+----------+------+
- True Negatives (247,966): Non-delayed flights correctly identified.
- False Positives (1,093): Flights incorrectly predicted as delayed.
- True Positives (39,468): Delayed flights correctly identified.
- False Negatives (11,922): Delayed flights incorrectly predicted as non-delayed.
- AUC (0.926): Indicates excellent ability to distinguish between delayed and non-delayed flights. The model effectively ranks predictions across thresholds.
- Accuracy (95.51%): The model accurately predicts delay status in most cases, reflecting its robustness.
- Precision (95.24%): High precision shows that the model minimizes false positives, meaning it seldom misclassifies non-delayed flights as delayed.
- Recall (77.46%): While slightly lower than precision, recall indicates the model captures most delayed flights but misses some.
- F1-Score (85.43%): Balances precision and recall, showing the model is reliable overall.
- Performance Leader:
- The Gradient Boosted Tree model outperforms all other models across all metrics, achieving the highest AUC, accuracy, precision, recall, and F1-score.
- Balanced Performance:
- Logistic Regression demonstrates a strong balance between metrics, with a lower false negative rate than the Neural Network.
- Interpretability:
- Random Forest provides feature importance insights, which are valuable for understanding delay factors, even though its performance is slightly below Gradient Boosted Tree and Logistic Regression.
- Non-linear Flexibility:
- The Neural Network and Gradient Boosted Tree excel in capturing non-linear relationships in the data, making them suitable for complex datasets.