Skip to content

Implementation of six different classification models on Stroke Prediction, complete with data analysis and performance evaluation

Notifications You must be signed in to change notification settings

timid-angel/stroke-prediction-classifiers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stroke Prediction Models

This project uses six different models to predict the occurrence of strokes based on various input features. The model is implemented in the notebook file, with a detailed report.

The dataset is imported from Kaggle, a complete overview of the dataset can be found on their website.

Getting Started

To get started with this project, follow the steps below to set up your environment and run the application.

Prerequisites

Ensure you have the following installed:

  • Python 3.7 or higher
  • Jupyter Notebook

Although the Linear Regression algorithm is implemented manually using Batch Gradient Descent, the following libraries are necessary to parse the data, display graphs and test the assumptions of Linear Regression.

The following packages are required to run the jupyter file in its entirety.

  • matplotlib - for plotting graphs

  • pandas - for parsing, filtering and cleaning the dataset

  • scikit-learn - for the classifiers and data set splitting function

  • seaborn - for plotting categorical features

  • imblearn - for the SMOTE function

These dependencies can be found in requirements.txt. To install all of them, simply run the command:

make install

Usage

The notebook provides step-by-step instructions and code cells for training a machine learning model to predict the occurrence of strokes based on individual health parameters. Below is a brief overview of the notebook's contents. It is important to note that the purpose of this project is not to choose one model that works best for the particular problem, but rather to analyze each model's performance when applied to the problem domain and report any implications based on the results of each metric. Hence, no validation set has been used and 30% of the dataset has been allocated to the testing set.

This project involves data preprocessing, data visualization and analysis, model training, and evaluation. It uses logistic regression and other classification techniques to predict stroke occurrences.

Data Preprocessing

This section performs the following operations:

  • Handling missing values by imputing median values for certain features (e.g., bmi).
  • Encoding categorical variables using one-hot encoding for gender, work_type, Residence_type, and other non-numeric columns.
  • Scaling numerical features (e.g., age, avg_glucose_level) using standard scaling to improve model performance.

Data Analysis

This section visualizes the relationships between the features and the target variable (stroke) using:

  • Histograms and bar charts for categorical features like gender and smoking_status.
  • Boxplots to identify potential outliers in key features such as bmi and avg_glucose_level.

Model Training

This section involves training a logistic regression model using the processed dataset. It includes:

  • Splitting the data into training (70%) and testing (30%) sets.
  • Importing the following algorithms from scikit-learn:
    • Logistic Regression
    • Naive Bayes
    • Gaussian Discriminant Analysis (GDA)
    • Support Vector Machine
    • Decision Tree
    • Random Forest
  • Training each model on the training set.

Model Testing and Performance Metrics

The final model is evaluated on the testing data subset. Performance metrics are calculated to ensure the model generalizes well to unseen data.

The performance of the classification model is evaluated using the following metrics:

  • Accuracy: Measures the proportion of correctly predicted instances.
  • Precision, Recall, and F1-Score: Provide insights into the model's balance between positive and negative class predictions.

About

Implementation of six different classification models on Stroke Prediction, complete with data analysis and performance evaluation

Topics

Resources

Stars

Watchers

Forks