Skip to content

Welcome to the Titanic Data Preprocessing & Modeling project! ๐ŸŒŠ This project showcases how to preprocess data, handle missing values, outliers, feature engineering, and build models like Logistic Regression and Random Forest! ๐Ÿ”ฅ

Notifications You must be signed in to change notification settings

yvesdylane/data_processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿšข LAB 3 : Titanic Data Processing & Modeling ๐Ÿง‘โ€๐Ÿ’ป

Welcome to the Titanic Data Preprocessing & Modeling project! ๐ŸŒŠ This project showcases how to preprocess data, handle missing values, outliers, feature engineering, and build models like Logistic Regression and Random Forest! ๐Ÿ”ฅ

๐Ÿ› ๏ธ Tools & Libraries Used

This project is built using Python ๐Ÿ and the following libraries:

Pandas ๐Ÿผ: For data manipulation.

NumPy ๐Ÿ”ข: For numerical operations.

Matplotlib ๐Ÿ“Š & Seaborn ๐ŸŽจ: For visualizations.

Scikit-learn ๐Ÿค–: For machine learning and model evaluation.

๐Ÿ” Project Overview

The main objective of this project is to preprocess the Titanic dataset and build models to predict passenger survival ๐Ÿ›ณ๏ธ. Here's a quick breakdown of the steps taken:

Data Collection ๐Ÿ“ฆ

Used the Titanic dataset from seaborn ๐ŸŽฏ

Data Cleaning ๐Ÿงน

Handle missing values, outliers, and invalid data ๐Ÿšฎ

Outliers Handling ๐Ÿšซ

Capped the extreme values in age and fare columns โœ‚๏ธ

Normalization ๐Ÿ“

Scaled the numerical features using Min-Max scaling or Z-score normalization ๐ŸŽš๏ธ

Feature Engineering ๐Ÿ› ๏ธ

Created new features like family_size and extracted title from names ๐Ÿ“›

Feature Selection ๐Ÿ“‘

Selected important features using correlation and feature importance analysis ๐Ÿ”

Model Building ๐Ÿง‘โ€๐Ÿ”ฌ

Built Logistic Regression & Random Forest models for classification ๐Ÿค–

๐Ÿš€ Steps to Run the Project

Clone this repository:

๐Ÿ“Š Results

After training our models, here's what we found:

Logistic Regression Results ๐Ÿš€

Accuracy: 0.79

Precision: 0.72

Recall: 0.72

F1 Score: 0.72

Random Forest Results ๐ŸŒฒ

Accuracy: 0.78

Precision: 0.74

Recall: 0.65

F1 Score: 0.69

๐Ÿ‘€ The Logistic Regression model performed slightly better in terms of recall, while Random Forest had higher precision.

๐Ÿค” Why Titanic Dataset?

The Titanic dataset is famous for demonstrating basic machine learning tasks such as classification ๐Ÿ“š. It's easy to understand yet provides a challenging problem with both categorical and numerical data ๐Ÿง .

๐Ÿ”ฅ Features of this Project

Easy-to-follow data preprocessing pipeline ๐Ÿ“‹

Intuitive visualizations ๐Ÿ“ˆ to understand the data and outliers

Fun feature engineering to get the most out of the dataset โš™๏ธ

Two powerful models: Logistic Regression and Random Forest ๐ŸŽฏ

๐Ÿ™Œ How to Contribute

Fork this repository ๐Ÿด

Create your feature branch: git checkout -b my-new-feature ๐ŸŒต

Commit your changes: git commit -am 'Add some feature' ๐Ÿ’พ

Push to the branch: git push origin my-new-feature ๐Ÿš€

Submit a pull request ๐ŸŽ‰

๐Ÿ’ก Fun Facts

Did you know? The Titanic was built in Belfast, Northern Ireland ๐Ÿ‡ฎ๐Ÿ‡ช.

The real story of the Titanic is both tragic and heroic, inspiring numerous books and movies ๐ŸŽฌ.

๐Ÿคฉ Meet the Team

Yves Dylane ๐Ÿ’ป Project Lead

๐Ÿ“ฌ Contact

Feel free to contact us with any questions or suggestions! ๐Ÿ“ง

Enjoy coding and have fun! ๐ŸŽ‰

About

Welcome to the Titanic Data Preprocessing & Modeling project! ๐ŸŒŠ This project showcases how to preprocess data, handle missing values, outliers, feature engineering, and build models like Logistic Regression and Random Forest! ๐Ÿ”ฅ

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages