This project is an end-to-end machine learning system that predicts the risk of heart disease using clinical features. It is built using the UCI Cleveland Heart Disease dataset and deployed as an interactive Streamlit web application.
Early detection of heart disease is critical for preventive healthcare. This project aims to estimate the probability of heart disease based on patient clinical attributes using machine learning.
- Source: UCI Cleveland Heart Disease Dataset
- Samples: 304 patients
- Target:
0→ No heart disease1→ Heart disease present
Only the Cleveland dataset was used to avoid data leakage and corrupted labels present in other variants.
- Data filtering (Cleveland-only)
- Target binarization
- Feature selection & cleanup
- Encoding:
- Binary: sex, fbs, exang
- One-hot: chest pain (cp), restecg
- Train/test split (stratified)
- Models:
- Logistic Regression (baseline)
- XGBoost (final model)
- Probability calibration (Isotonic Regression)
- Model explainability using SHAP
- Deployment using Streamlit
| Model | Accuracy | F1 Score | ROC-AUC |
|---|---|---|---|
| Logistic Regression | 0.87 | 0.85 | 0.92 |
| XGBoost (Calibrated) | 0.89 | 0.88 | 0.92 |
Calibration improved probability reliability (Brier score: 0.094).
Top predictive features:
- Sex
- Oldpeak (ST depression)
- Maximum heart rate (thalach)
- Age
- Chest pain type
- Python
- scikit-learn
- XGBoost
- SHAP
- Streamlit
- pandas, numpy
- joblib
pip install -r requirements.txt
streamlit run app/app.py