Welcome to the Diabetes Health Prediction and Analysis project! This repository contains a comprehensive pipeline for predicting diabetes diagnosis using various machine learning and deep learning models, along with an in-depth exploratory data analysis and feature engineering steps.
This project aims to provide a thorough analysis of diabetes-related health data, develop predictive models, and evaluate their performance. The key components of the project include:
- ๐ Data Preprocessing
- ๐ Exploratory Data Analysis (EDA)
- ๐ ๏ธ Feature Engineering
- ๐ง Model Training
- ๐ Model Evaluation
- ๐ Comprehensive Reports
Here's an overview of the project directory structure:
Diabetes_Health_Prediction_and_Analysis/
โโโ data/
โ โโโ raw/
โ โ โโโ diabetes_data.csv
โ โโโ processed/
โ โ โโโ X_train.csv
โ โ โโโ X_train_engineered.csv
โ โ โโโ X_test.csv
โ โ โโโ X_test_engineered.csv
โ โ โโโ y_train.csv
โ โ โโโ y_test.csv
โโโ app/
โ โโโ app.py
โ โโโ templates/
โ โ โโโ index.html
โ โโโ static/
โ โโโ styles.css
โโโ models/
โ โโโ logistic_regression.pkl
โ โโโ random_forest.pkl
โ โโโ xgboost.pkl
โโโ notebooks/
โ โโโ exploratory_data_analysis.ipynb
โโโ scripts/
โ โโโ plots/
โ โโโ reports/
โ โโโ data_preprocessing.py
โ โโโ feature_engineering.py
โ โโโ model_training.py
โ โโโ model_evaluation.py
โ โโโ model_performance_report.py
โโโ tests/
โ โโโ models/
โ โโโ test_data_preprocessing.py
โ โโโ test_feature_engineering.py
โ โโโ test_model_training.py
โโโ requirements.txt
โโโ README.md
To get started with this project, follow the steps below:
-
Clone the repository:
git clone https://github.com/ThecoderPinar/Diabetes_Health_Prediction_and_Analysis.git cd Diabetes_Health_Prediction_and_Analysis
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
-
Run the data preprocessing script:
python scripts/data_preprocessing.py
-
Run the feature engineering script:
python scripts/feature_engineering.py
-
Train the models:
python scripts/model_training.py
-
Evaluate the models:
python scripts/model_evaluation.py
-
Generate comprehensive model performance reports:
python script/comprehensive_model_report.py
- Exploratory Data Analysis: Check the
notebooks/exploratory_data_analysis.ipynb
notebook for detailed data analysis and visualizations. - Scripts: All scripts for data preprocessing, feature engineering, model training, and evaluation are located in the
scripts/
directory. - Tests: To ensure code quality and correctness, tests are included in the
tests/
directory. Run them withpytest
.
The following models are trained and evaluated in this project:
The ROC curve illustrates the true positive rate (sensitivity) versus the false positive rate (1-specificity) for different threshold settings. A higher area under the curve (AUC) indicates better model performance.
The confusion matrix provides a summary of the prediction results on the classification problem. It shows the number of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions.
The ROC curve illustrates the true positive rate (sensitivity) versus the false positive rate (1-specificity) for different threshold settings. A higher area under the curve (AUC) indicates better model performance.
*The confusion matrix provides a summary of the prediction results on
The performance of the models is evaluated using the following metrics:
- Accuracy
- Precision
- Recall
- F1 Score
- ROC AUC Score
- Confusion Matrix
- Accuracy (Doฤruluk): %78.99
- Precision (Kesinlik): %73.19
- Recall (Duyarlฤฑlฤฑk): %70.63
- F1 Score: %71.89
- ROC AUC: %83.86
Confusion Matrix:
[[196 37]
[ 42 101]]
Model dosyasฤฑ:
models/logistic_regression.pkl
- Accuracy (Doฤruluk): %91.22
- Precision (Kesinlik): %94.35
- Recall (Duyarlฤฑlฤฑk): %81.82
- F1 Score: %87.64
- ROC AUC: %97.69
Confusion Matrix:
[[226 7]
[ 26 117]]
Model dosyasฤฑ:
models/random_forest.pkl
- Accuracy: The ratio of correctly predicted instances to the total instances.
- _Precision:_ The ratio of true positive predictions to the total predicted positives. It measures the accuracy of positive predictions.
- Recall: The ratio of true positive predictions to the actual positives. It measures the model's ability to identify positive instances.
- F1 Score: The harmonic mean of precision and recall. It provides a balance between precision and recall.
- ROC AUC: The area under the ROC curve. It summarizes the model's ability to distinguish between classes.
Confusion Matrix:
- True Positive (TP): 117 - The number of actual positive cases correctly identified by the model.
- True Negative (TN): 226 - The number of actual negative cases correctly identified by the model.
- False Positive (FP): 7 - The number of actual negative cases incorrectly identified as positive by the model.
- False Negative (FN): 26 - The number of actual positive cases incorrectly identified as negative by the model.
- Accuracy: The ratio of correctly predicted instances to the total instances.
- Precision: The ratio of true positive predictions to the total predicted positives. It measures the accuracy of positive predictions.
- Recall: The ratio of true positive predictions to the actual positives. It measures the model's ability to identify positive instances.
- F1 Score: The harmonic mean of precision and recall. It provides a balance between precision and recall.
- ROC AUC: The area under the ROC curve. It summarizes the model's ability to distinguish between classes.
Confusion Matrix:
- True Positive (TP): 117 - The number of actual positive cases correctly identified by the model.
- True Negative (TN): 226 - The number of actual negative cases correctly identified by the model.
- False Positive (FP): 7 - The number of actual negative cases incorrectly identified as positive by the model.
- False Negative (FN): 26 - The number of actual positive cases incorrectly identified as negative by the model.
- Accuracy (Doฤruluk): %91.76
- Precision (Kesinlik): %93.08
- Recall (Duyarlฤฑlฤฑk): %84.62
- F1 Score: %88.64
- ROC AUC: %98.41
Confusion Matrix:
[[224 9]
[ 22 121]]
Model dosyasฤฑ:
models/xgboost.pkl
- Accuracy: The ratio of correctly predicted instances to the total instances.
- Precision: The ratio of true positive predictions to the total predicted positives. It measures the accuracy of positive predictions.
- Recall: The ratio of true positive predictions to the actual positives. It measures the model's ability to identify positive instances.
- F1 Score: The harmonic mean of precision and recall. It provides a balance between precision and recall.
- ROC AUC: The area under the ROC curve. It summarizes the model's ability to distinguish between classes.
Confusion Matrix:
- True Positive (TP): 121 - The number of actual positive cases correctly identified by the model.
- True Negative (TN): 224 - The number of actual negative cases correctly identified by the model.
- False Positive (FP): 9 - The number of actual negative cases incorrectly identified as positive by the model.
- False Negative (FN): 22 - The number of actual positive cases incorrectly identified as negative by the model.
Model performance reports and evaluation metrics are saved and displayed in the comprehensive_model_report.py
script output.
- Implement more advanced deep learning models (e.g., Neural Networks, LSTM).
- Perform hyperparameter tuning to optimize model performance.
- Explore feature selection techniques to improve model accuracy.
- Integrate additional health datasets for broader analysis.
Contributions are welcome! Please feel free to submit a Pull Request.
Whether it's improving the documentation, adding new features, or fixing bugs, your contributions are highly appreciated. Let's make this project better together! ๐
-
Fork the Repository: Click on the 'Fork' button at the top right corner of this page to create a copy of this repository in your GitHub account.
-
Clone the Forked Repository:
git clone https://github.com/your-username/Diabetes_Health_Prediction_and_Analysis.git
-
Create a New Branch:
git checkout -b feature/your-feature-name
-
Make Your Changes: Implement your feature, bug fix, or improvement.
-
Commit Your Changes:
git commit -m "Add your commit message here"
-
Push to Your Forked Repository:
git push origin feature/your-feature-name
-
Open a Pull Request: Go to the original repository on GitHub and click on the 'New Pull Request' button. Compare changes from your forked repository and submit the pull request.
Thank you for your contributions! Together, we can build a more robust and efficient Diabetes Health Prediction and Analysis tool. ๐
This project is licensed under the MIT License.
If you have any questions or suggestions, feel free to open an issue or contact me directly. I am always open to feedback and would love to hear from you!
- Email: piinartp@gmail.com
- GitHub Issues: Open an Issue
- LinkedIn: Your LinkedIn Profile
Thank you for your interest in the Diabetes Health Prediction and Analysis project! Your feedback and suggestions are invaluable in making this project better and more useful for everyone. ๐
โญ๏ธ Don't forget to give this project a star if you found it useful! โญ๏ธ