Skip to content

This project explores the predictive modeling workflow using the Kaggle competition "Titanic - Machine Learning from Disaster." It emphasizes key stages like data analysis and model evaluation, aiming to identify the optimal model. Through a real-world approach, we enhance our understanding of the workflow and emphasize rigorous model evaluation.

License

Notifications You must be signed in to change notification settings

virchan/predictive_modeling_workflow

Repository files navigation

Predictive Modelling: Insights from a Kaggle Competition

Abstract

This project provides a comprehensive exploration of the predictive modelling workflow using the Kaggle competition "Titanic - Machine Learning from Disaster" as a case study. It delves into key stages such as exploratory data analysis and model evaluation, aiming to identify the most effective model for the competition. By treating the competition as a real-world predictive modelling task, this project enhances our understanding of the workflow while emphasizing the importance of rigorous model evaluation.

Introduction

Data permeates every aspect of our world, and harnessing its power through data-driven decision-making is crucial for progress. Predictive modelling is a powerful tool that enables organizations to streamline these decision-making processes, finding insights and making accurate predictions based on available data. Its applications span across diverse industries, such as healthcare, finance, and more. For instance, in the medical industry, predictive modelling can help hospitals predict human survival rates based on patients' medical and health conditions. In the financial industry, credit card companies leverage predictive models to determine whether a transaction is fraudulent. These real-life applications highlight the practical value of predictive modelling.

The workflow of predictive modelling involves several key stages. It begins with data acquisition, exploratory data analysis, and data cleaning to ensure the data is suitable for analysis. Subsequently, the data is tailored and fed into computers, which employ advanced algorithms to learn intricate patterns and relationships within the data. Once the learning process is complete, predictions are generated, enabling decision-makers to make informed choices based on the insights provided by the predictive models. This workflow forms the foundation of predictive modelling.

In this work, we simulate the predictive modelling workflow by participating in the Kaggle Getting Started competition Titanic - Machine Learning from Disaster (Kaggle link). By drawing an analogy between the competition's subtasks and the stages of real-life predictive modelling, we aim to provide an educational and insightful experience. The analogy is established as follows:

Kaggle Competition Subtasks Analogy to Real-life Predictive Modelling
Training the Model Model Training Phase
Testing the Model Offline Evaluation Phase
Submitting Predictions to Kaggle Online Evaluation Phase

Our workflow encompasses three main stages. The first stage is data preparation, which involves downloading the data, performing exploratory data analysis, and cleaning the dataset. The outcome of this stage is a cleaned dataset that will be used in the subsequent stage, model training. During the model training phase, we trained nine different machine learning models using various algorithms, including ensemble methods like random forest, deep learning methods like feedforward neural networks, and probabilistic models like Bayesian classifiers. These models were further scrutinized for overfitting, with decision tree classifiers, random forest classifiers, and neural networks exhibiting signs of overfitting. Through rigorous analysis, the support vector classifier emerged as the best model, demonstrating high accuracy and superior performance in minimizing both false positives and false negatives in classification tasks.

Next, we submitted our predictions to Kaggle to validate our findings on overfitting and model ranking. This process mirrors the online evaluation phase, allowing for an extensive comparison between the online and offline evaluation results. The Kaggle submission aligned with our conclusions on overfitting models and accurately predicted the best-performing model in the Kaggle competition based solely on testing performance.

This document primarily focuses on demonstrating the following workflows in predictive modelling:

  1. Data Aquisition
  2. Exploration Data Analysis
  3. Data Cleaning
  4. Data Preprocessing
  5. Model Training
  6. Model Evaluation

with a particular emphasis on model evaluation. It includes three Jupyter notebook files that delve into the technical details and discussions of the presented work:

Jupyter Notebook File Workflow Description
demo_titanic_data_cleaning.ipynb Data Preparation Includes data acquisition, exploratory data analysis, data cleaning, and data preprocessing.
demo_ML_models.ipynb Model Training Trains machine learning models to predict Titanic passenger survival.
demo_model_evaluation.ipynb Model Evaluation Focuses on overfitting detection and analysis of confusion matrices.

Kaggle provides an excellent platform for accessing fascinating datasets and showcasing data-related talents. The work presented here aims to educate and document our workflow and results for reference within the data science and machine learning community.

Data Preparation

After downloading the data from Kaggle, we conducted exploratory data analysis to identify areas for data cleaning and preprocessing. Here are the key findings:

  1. The Embarked column has missing values, which can be filled using reliable information from a third-party source.
  2. The Cabin column requires cleaning by replacing the values with the corresponding cabin codes.
  3. The Age column has missing values that can be filled with the mean age of the respective age groups.
  4. To simplify the data, we replaced the SibSp and Parch columns with a new column called group_size, representing the total number of people traveling with each passenger.
  5. We removed less relevant or redundant features by excluding the Fare, Name, and Ticket columns from further analysis.

To streamline the data preprocessing process, we developed our own titanic_data_cleaning library. Using this library, we performed the necessary data cleaning steps and obtained the cleaned data set stored as csv/train_cleaned.csv and csv/test_cleaned.csv files. The csv/train_cleaned.csv file was used for model training, where we trained nine distinct machine learning models to predict Titanic passenger survival. Subsequently, these trained models generated predictions on the csv/test_cleaned.csv file, and the predictions were submitted to the Kaggle competition for evaluation.

For a detailed walkthrough of our data preparation procedure, please refer to the demo_titanic_data_cleaning.ipynb file.

Model Training

The following machine learning models have been trained with the goal of predicting the survival of Titanic passengers.

Name Notation Type
Dummy Classifier dummy Baseline Model
Decision Tree Classifier tree Base Model
Random Forest Classifier forest Ensemble Method
Support Vector Classifier support_vector Support Vector Machine
Neural Network neural_network Deep Learning Method
Logistic Regression logistic Linear Model
Gaussian Naive Bayes Classifier gaussian_NB Probabilistic Model
Bernoulli Naive Bayes Classifier bernoulli_NB Probabilistic Model
AdaBoost Classifier adaboost Ensemble Method

To establish a baseline for model comparison, we include the Dummy Classifier. This allows us to assess the reliability of other classifiers, as any model that falls short of outperforming the Dummy Classifier lacks credibility. While Logistic Regression serves as the simplest non-trivial model for classification tasks, we recognize its limitations in capturing complex data structures associated with human survivability. To address this, we introduce the Neural Network and Support Vector Machine models, which offer enhanced reliability. Furthermore, we acknowledge that the features in our training set are categorical. To facilitate practical model training, we convert these categorical features into binary features, treating them as a series of yes-or-no questions. This conversion aligns naturally with the Decision Tree Classifier, Bernoulli Naive Bayes Classifier, and arguably, the Random Forest Classifier. Additionally, we include the Gaussian Naive Bayes Classifier and Adaboost Classifier to diversify our range of models.

For detailed information on each model and its specific parameters, please refer to the demo_ML_models.ipynb file.

Model Evaluation

Within the realm of predicting the survival of Titanic passengers, a thorough evaluation has been undertaken, involving the training of nine distinct machine learning models. The primary focus of this evaluation phase is to identify the most suitable model for the upcoming Kaggle competition, utilizing the provided training data.

The foundation of our evaluation phase lies in the accuracy scores derived from each prediction submitted to Kaggle (Kaggle leaderboard link). In the preceding section, we trained nine different models using the csv\train_cleaned.csv dataset and obtained the corresponding training and testing predictions from the csv\data\predictions.csv dataset. Our aim is to identify the model with the highest Kaggle submission accuracy, based solely on its training and testing performances using the predictions.csv dataset. By solely relying on these performances, we can make a reliable assessment of the model's predictive capabilities and its potential for success in the Kaggle competition. The evaluation procedure unfolds as follows:

  1. Firstly, We examine the training and testing accuracies of each model to detect signs of overfitting.
  2. Next, we select the models that pass the overfitting test and exhibit the highest testing accuracies.
  3. In the event of a tie, we employ post-hoc analyses, such as analysis on confusion matrices, to identify the better model.
  4. Finally, we compare our findings with the Kaggle submission accuracies, validating if our recommended choice indeed emerges as the Kaggle champion among all evaluated models, or if it simply represents another instance of overfitting.

We initiate this section with a brief examination of the training and testing accuracies, thereby providing valuable insights into the performance of each model. Subsequently, we perform hypothesis testing to detect any potential cases of overfitting across the models. To make an informed decision for the Kaggle competition, the application of confusion matrices enables a comprehensive assessment of the models, ultimately leading to the identification of the optimal model.

Supplementing this section are the demo_model_evaluation.ipynb and the titanic_ml_classes\titanic_evaluation_helpers.py files. The former serves as a comprehensive guide, providing essential coding details for the evaluation phase. Notably, it includes the source codes for each captivating visualization exemplified in this section. The latter is a collection of custom Python functions, tailored to facilitate our in-depth analysis. Together, these invaluable assets provide both the practical implementation and technical tools necessary for a comprehensive understanding of our evaluation methodology.

A First Inpsection of the Training and Testing Accuracies

The training and testing accuracy for each model are given in the bar chart.

The neural_network model achieves the highest testing accuracy of 0.8320, showing a marginal improvement of only 0.91% over the support_vector and logistic models, which obtained a score of 0.8246. Following closely in the third position is the forest model with an accuracy of 0.8209, followed by adaboost (0.8097), gaussian_NB (0.7799), and bernoulli_NB (0.7724). It is noteworthy that the dummy model exhibits the lowest performance, scoring only 0.6157.

Can we infer from these results that the neural_network model will surpass the others in the final submission stage? It is worth noting that there is a substantial disparity between the training and testing accuracies for the neural_network, as well as the tree and forest models. Therefore, it is crucial to determine if a candidate is overfit before considering it as the champion. In fact, evaluating for overfitting should be the first step in the model assessment process.

Hypothesis Testing and Cofidence Intervals in Overfitting Detection

The general principle states that a reliable machine learning model should exhibit consistent performance in both the training and testing phases. Therefore, our primary goal is to identify any notable disparities between the training and testing performances, which would indicate an overfitted model.

In an analytical context, the comparison involves analyzing data obtained from the training and testing phases. This is where hypothesis testing comes into play. We formulate the null hypothesis as "the training accuracy is equal to the testing accuracy" and the alternative hypothesis as "the training accuracy is not equal to the testing accuracy" for each model:

To facilitate this analysis, we create a new dataframe called correct_predictions, where the column correct_predictions[model] is a binary array that indicates whether a sample is correctly classified. Please refer to the screenshot below for a glimpse of correct_predictions.head().

This construction is crucial, because the column correct_predictions[model] represents a series of n = number of rows independent Bernoulli trials, each having probability p = model's accuracy of success. Under this description, the de Moivre-Laplace Theorem (Wikipedia link) tells us that the normal distribution naturally emerges within our dataset, allowing us to utilize either the Z-test (if n > 30) or the Student's t-test (if n <= 30).

By applying the Two Proportion Z-Test to the columns correction_predictions[model] for train = 0 and train = 1, we can compare the accuracies and effectively quantify the detection of overfitting. We have set the significant level alpha = 0.05. Additionaly, to futher analyze the results, we also compute the 95% confidence intervals for the training and testing accuracies. The result is displayed in the following table, a corresponding image file is available here.

Model Training Accuracy Testing Accuracy 95% Confidence Interval for Training Accuracy 95% Confidence Interval for Testing Accuracy z-score p-value Reject Null Hypothesis?
dummy 0.6164 0.6157 (0.5782, 0.6546) (0.5574, 0.6739) 0.0197 0.9843 False
tree 0.8941 0.7948 (0.8699, 0.9182) (0.7464, 0.8431) 3.9673 0.0001 True
forest 0.8941 0.8209 (0.8699, 0.9182) (0.7750, 0.8668) 2.9984 0.0027 True
support_vector 0.8636 0.8246 (0.8366, 0.8905) (0.7791, 0.8702) 1.5004 0.1335 False
neural_network 0.8828 0.8321 (0.8576, 0.9081) (0.7873, 0.8768) 2.0490 0.0405 True
logistic 0.8283 0.8246 (0.7986, 0.8579) (0.7791, 0.8702) 0.1312 0.8956 False
gaussian_NB 0.7721 0.7799 (0.7391, 0.8050) (0.7302, 0.8295) -0.2548 0.7989 False
bernoulli_NB 0.7849 0.7724 (0.7526, 0.8172) (0.7222, 0.8226) 0.4146 0.6784 False
adaboost 0.8170 0.8097 (0.7867, 0.8474) (0.7627, 0.8567) 0.2577 0.7966 False

The hypothesis test reveals statistically significant differences between the training and testing accuracies for the tree, forest, and neural_network models. Moreover, the Jaccard index (Wikipedia link)

demonstrates a substantial dissimilarity between the training and testing confidence intervals. These findings strongly indicate the presence of overfitting. As a result, we have excluded the tree, forest, and neural_network models from the selection pool, even the neural_network model exhibits the highest testing accuracy.

Our next canddidates are support_vector and logistic, sharing an accuracy of 0.8246. How should one break the tie in a situation like this?

Analysis of Model Performance using Confusion Matrix

The accuracy of a model is defined as the ratio of correctly classified instances to the total number of instances, as shown by the formula:

where TP, TN, FP, and FN represent the numbers of True Positives, True Negatives, False Positives, and False Negatives, respectively. It is important to note that maximizing accuracy involves minimizing both False Positives and False Negatives. Therefore, it becomes crucial to thoroughly understand a model's capability in minimizing these errors.

To facilitate this understanding, we have included visual representations of six confusion matrices in the form of four bar graphs, which can be seen below:

Analyzing these visualizations, we observe that the support_vector model appears to demonstrate a greater control over FN and FP. We utilize the following evaluation metrics to quantify the model's performance on suppressing incorrectly classified instances:

Metric Formula
True Positive Rate (Sensitivity)
Negative Predictive Value
True Negative Rate (Specifity)
Positive Predictive Value
Balanced Accuracy
F-score

Take the True Negative Rate (TNR), for instance. Having a TNR closer to 1 is equivalent to having a lower value for FP. Therefore, a higher TNR score indicates a stronger ability to minimize FP.

For a more detailed and technical discussion on the evaluation metrics, including the rationale behind their selection, we recommend referring to the demo_model_evaluation.ipynb file. Within the notebook, you will find in-depth explanations and insights into the metrics, providing a deeper understanding of their significance in our analysis.

We have also provided visual representations of these metrics in the form of bar graphs:

In terms of FN-minimizing metrics, such as the True Positive Rate and the Negative Predictive Value, the support_vector model outperforms the logistic model. Specifically, the True Positive Rate in testing is 0.7573 for the support_vector model compared to 0.7379 for the logistic model, while the Negative Predictive Value is 0.8512 for the support_vector model and 0.8430 for the logistic model. Consequently, the support_vector model demonstrates a 0.96% improvement in minimizing False Negatives compared to the logistic model.

Conversely, when examining FP-minimizing metrics, such as the Positive Predictive Value and the True Negative Rate, demonstrate the opposite trend. The logistic model exhibits a more favorable trend with values of 0.7917 and 0.8788 respectively, compared to 0.7800 and 0.8667 for the support_vector model.This explains the identical accuracy observed between the support_vector and logistic models: the former excels in controlling False Negatives, while the latter showcases superior performance in suppressing False Positives. Overall, the logistic model outperforms the support_vector model by at least 1.38% in terms of minimizing False Positives.

To make a definitive decision between the two models, it becomes imperative to employ metrics that effectively minimize both False Positives and False Negatives simultaneously. Notably, the support_vector model outperforms the logistic model in metrics like the Balanced Accuracy (0.8120 for support_vector versus 0.8083 for logistic) and the F-score (0.7685 for support_vector versus 0.7638 for logistic) based on testing results. Consequently, the support_vector model surpasses the logistic model by at least 0.46% in effectively minimizing both False Positives and False Negatives.

In summary, the support_vector model demonstrates superior control over the occurrence of False Positives and False Negatives compared to the logistic model, making it the recommended choice for the Kaggle competition. Our comprehensive analysis provides valuable insights into the model's performance, allowing us to make an informed comparison with Kaggle submissions.

Evaluating Model Performance: Kaggle Submission Results

Based on the previous analysis, two key conclusions can be drawn:

  1. at the 95% confidence level, the tree, forest, and neural_network models exhibit signs of overfitting.
  2. Despite sharing the same testing accuracy, the support_vector model is expected to outperform the logistic model in the Kaggle competition.

To validate these conclusions, predictions from each model were submitted to the Kaggle competition. The bar chart below illustrates the Kaggle submission results, and a screenshot of the Kaggle submission result can be viewed here.

Among the models evaluated, the support_vector model emerged as the top-performing model with an impressive accuracy score of 0.7895 in the Kaggle submission. Following closely behind is the logistic model, which attained a commendable accuracy score of 0.7751, securing the second position. The forest model occupied the third position with a respectable accuracy score of 0.7656, followed by adaboost (0.7632), neural_network (0.7584), tree (0.7536), bernoulli_NB (0.7344), and gaussian_NB (0.7153).

It is important to note that the accuracy values displayed in the bar chart may differ slightly due to a known bug related to rounding in the matplotlib library. However, this discrepancy does not affect the rankings and relative performance of the models.

Among the evaluated models, the dummy model exhibited the lowest performance, achieving an accuracy score of only 0.6220.

Furthermore, analyzing the Kaggle submission results reinforces our observations regarding the overfitting behavior of the models. All Kaggle submission accuracies, except for the dummy model, are lower than the testing accuracies. This indicates that the tree, forest, and neural_network models indeed suffer from overfitting, as their performance declines when applied to unseen data. Notably, the submission accuracies of the forest and neural_network models lie outside the 95% confidence interval of their testing accuracies, highlighting a more severe overfitting condition.

In contrast, the submission accuracy for the support_vector model falls within the confidence interval of the testing accuracy. This consistency between the testing phase and the final submission phase reaffirms the reliability of the support_vector model, which demonstrates the best performance among all evaluated models in the Kaggle competition.

Conclusion

The Kaggle submission results reinforce our conclusions regarding the overfitting behavior of the tree, forest, and neural_network models. Additionally, the support_vector model, with the highest submission accuracy among all evaluated models, proves to be a reliable choice for the Kaggle competition.

Overall, these findings emphasize the importance of addressing overfitting issues and selecting appropriate models to achieve optimal performance in machine learning competitions.

About

This project explores the predictive modeling workflow using the Kaggle competition "Titanic - Machine Learning from Disaster." It emphasizes key stages like data analysis and model evaluation, aiming to identify the optimal model. Through a real-world approach, we enhance our understanding of the workflow and emphasize rigorous model evaluation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published