A single notebook that applies XGBoost binary classification to two independent medical datasets — Cardiovascular Disease and Diabetes — demonstrating the algorithm's versatility across different feature spaces, class distributions, and hyperparameter regimes.
- Why Two Datasets in One Project
- Repository Layout
- Datasets
- 3.1 Cardiovascular Disease
- 3.2 Diabetes
- Methodology
- Model Configuration
- Results
- Key Takeaways
- Reproducing the Analysis
- Tech Stack
- Roadmap
- License
Running the same algorithm on two structurally different problems in a single notebook is a deliberate choice, not a limitation. It lets a reader — or a hiring manager — see three things at once:
| What it shows | How |
|---|---|
| Algorithm mastery | The same XGBoost API is correctly wired for both tasks without copy-paste errors |
| Hyperparameter awareness | max_depth is set to 10 for the large cardiovascular dataset and deliberately reduced to 1 for the small diabetes dataset — a conscious trade-off, not a random pick |
| Generalisation thinking | The notebook is not over-fit to one domain; the pipeline transfers cleanly to a completely different feature space |
This is standard practice in ML portfolios. A single-algorithm, multi-domain notebook is more informative than two half-finished notebooks sitting in separate repos.
xgboost-classification/
│
├── XGBoost_Classification_Problem.ipynb ← End-to-end notebook (both models)
├── dataset.csv ← Combined or primary dataset file
├── requirements.txt ← Pinned Python dependencies
├── LICENSE ← MIT License
└── README.md ← This file
| Attribute | Value |
|---|---|
| Rows | 70,000 |
| Raw columns | 13 |
| Columns after preprocessing | 12 (id dropped) |
| Target column | cardio (0 = healthy, 1 = cardiovascular disease) |
| Target balance | 49.97 % positive — nearly perfectly balanced |
| Age encoding | Originally in days; converted to years (age / 365) |
Features used for training (11):
| Feature | Description |
|---|---|
age |
Patient age in years (converted from days) |
gender |
1 = female, 2 = male |
height |
Height in cm |
weight |
Weight in kg |
ap_hi |
Systolic blood pressure |
ap_lo |
Diastolic blood pressure |
cholesterol |
1 = normal, 2 = above normal, 3 = well above normal |
gluc |
1 = normal, 2 = above normal, 3 = well above normal |
smoke |
0 = non-smoker, 1 = smoker |
alco |
0 = no alcohol, 1 = alcohol use |
active |
0 = physically inactive, 1 = active |
Top correlations with the cardio target:
| Feature | Pearson r |
|---|---|
| age | 0.238 |
| cholesterol | 0.221 |
| weight | 0.182 |
| gluc | 0.089 |
| Attribute | Value |
|---|---|
| Rows | 768 |
| Columns | 9 |
| Target column | Outcome (0 = no diabetes, 1 = diabetes) |
| Target balance | 35.0 % positive — moderately imbalanced |
Features used for training (8):
| Feature | Description |
|---|---|
Pregnancies |
Number of pregnancies |
Glucose |
Plasma glucose concentration (2-hour oral glucose tolerance test) |
BloodPressure |
Diastolic blood pressure (mm Hg) |
SkinThickness |
Triceps skin fold thickness (mm) |
Insulin |
2-hour serum insulin (mu U/ml) |
BMI |
Body mass index (kg/m²) |
DiabetesPedigreeFunction |
Genetic likelihood score based on family history |
Age |
Age in years |
Both models follow the same reproducible pipeline. The notebook is divided into clearly labelled sections so each stage is easy to locate.
┌──────────────────────┐
│ 1. Import & Load │ Read CSV into pandas DataFrame
└───────────┬──────────┘
▼
┌──────────────────────┐
│ 2. EDA │ .describe() · .info() · missing values · duplicates
└───────────┬──────────┘
▼
┌──────────────────────┐
│ 3. Visualisation │ Histograms · correlation heatmap (Seaborn)
└───────────┬──────────┘
▼
┌──────────────────────┐
│ 4. Preprocessing │ Drop irrelevant columns · feature/target split
└───────────┬──────────┘
▼
┌──────────────────────┐
│ 5. Train/Test Split │ 80 / 20 split via sklearn
└───────────┬──────────┘
▼
┌──────────────────────┐
│ 6. Model Training │ XGBClassifier with tuned hyperparameters
└───────────┬──────────┘
▼
┌──────────────────────┐
│ 7. Evaluation │ Accuracy · Classification Report · Confusion Matrix
└──────────────────────┘
Both models use binary:logistic as the objective and error as the
evaluation metric. The key difference is max_depth — intentionally tuned
for each dataset's size and complexity.
| Hyperparameter | Cardiovascular | Diabetes | Rationale |
|---|---|---|---|
objective |
binary:logistic |
binary:logistic |
Binary classification in both cases |
eval_metric |
error |
error |
Tracks misclassification rate during boosting |
learning_rate |
0.1 | 0.1 | Standard starting point; works well with low n_estimators |
max_depth |
10 | 1 | 70 k rows can support deep trees; 768 rows cannot — deeper trees would overfit immediately |
n_estimators |
10 | 10 | Kept low for both to keep training fast; increasing this is the first tuning lever |
Note: The diabetes cell originally passed
use_label_encoder=False. This parameter was deprecated and removed in recent XGBoost versions. It triggers a harmless warning but has no effect on results. Removing it cleans up the output.
Accuracy: 73.38 % on 14,000 test samples.
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 — Healthy | 0.71 | 0.77 | 0.74 | 6,939 |
| 1 — Disease | 0.75 | 0.69 | 0.72 | 7,061 |
| Macro Avg | 0.73 | 0.73 | 0.73 | 14,000 |
| Weighted Avg | 0.73 | 0.73 | 0.73 | 14,000 |
The model performs evenly across both classes. Precision and recall are balanced, which is expected given the near-perfect 50/50 target split in the training data. No class dominates the predictions.
Accuracy: 73.38 % on 154 test samples.
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 — No Diabetes | 0.74 | 0.94 | 0.83 | 104 |
| 1 — Diabetes | 0.71 | 0.30 | 0.42 | 50 |
| Macro Avg | 0.73 | 0.62 | 0.62 | 154 |
| Weighted Avg | 0.73 | 0.73 | 0.70 | 154 |
The accuracy headline is identical to the cardiovascular model, but the
per-class breakdown tells a very different story. The model correctly identifies
94 % of healthy patients but only catches 30 % of actual diabetes cases. This
is a direct consequence of max_depth = 1: a single-split tree cannot capture
the interaction effects that distinguish diabetic patients. This is discussed
further in the Takeaways section.
| Metric | Cardiovascular | Diabetes |
|---|---|---|
| Dataset size | 70,000 | 768 |
| Test set size | 14,000 | 154 |
| Target balance | 50 / 50 | 65 / 35 |
max_depth |
10 | 1 |
| Accuracy | 73.38 % | 73.38 % |
| Macro F1 | 0.73 | 0.62 |
| Recall — Positive class | 0.69 | 0.30 |
| Precision — Positive class | 0.75 | 0.71 |
1. Accuracy alone does not tell the full story. Both models report 73.38 % accuracy, but their macro F1 scores diverge sharply (0.73 vs 0.62). The cardiovascular model is genuinely balanced; the diabetes model is heavily biased toward predicting the majority class. Always inspect per-class metrics, especially when class imbalance is present.
2. max_depth is the most consequential hyperparameter here.
Reducing max_depth from 10 to 1 collapses the diabetes model's ability to
learn complex decision boundaries. With only 768 rows, deeper trees risk
overfitting — but depth 1 is too aggressive a correction. A value in the
3–5 range, combined with cross-validation, would likely recover significant
recall on the positive class without memorising the training set.
3. Dataset size dictates model complexity. The cardiovascular dataset (70 k rows) can comfortably support 10 levels of tree depth. The diabetes dataset (768 rows) cannot. This project makes that trade-off explicit and visible, which is more educational than hiding it behind a single pre-tuned configuration.
4. Class imbalance changes the evaluation game. The cardiovascular target is nearly 50/50, so accuracy is a reasonable summary metric. The diabetes target is 65/35 — not extreme, but enough that a model can achieve 65 % accuracy by simply predicting "no diabetes" every single time. Metrics like F1 and recall on the minority class become essential guards against that trap.
| Software | Minimum Version |
|---|---|
| Python | 3.10+ |
| Jupyter Notebook / JupyterLab / VS Code with Jupyter extension | Latest |
# 1. Clone the repository
git clone https://github.com/khadijja1/xgboost-classification.git
cd xgboost-classification
# 2. (Recommended) Create and activate a virtual environment
python -m venv venv
# Windows PowerShell: venv\Scripts\activate
# macOS / Linux: source venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txtjupyter notebook XGBoost_Classification_Problem.ipynbExecute all cells top-to-bottom using Kernel → Restart & Run All.
The notebook reads datasets via relative paths. Keep the CSV files in the same directory as the
.ipynbfile.
| Library | Role |
|---|---|
pandas |
DataFrame I/O and manipulation |
numpy |
Numerical operations |
scikit-learn |
Train/test split, classification report, confusion matrix |
xgboost |
XGBClassifier — the core model |
matplotlib |
Histogram plots |
seaborn |
Correlation heatmaps, confusion matrix heatmaps |
- Cross-validation — Replace the single 80/20 split with 5-fold stratified CV to get more reliable accuracy and F1 estimates, especially for the small diabetes dataset.
- Hyperparameter tuning — Use
GridSearchCVorRandomizedSearchCVto sweepmax_depth,n_estimators, andlearning_ratesystematically. - Class-weight handling — Apply
scale_pos_weightin XGBoost orSMOTEresampling to address the diabetes class imbalance and recover recall on the positive class. - Feature importance — Plot
feature_importances_from both trained models to identify which clinical indicators drive predictions most. - Remove deprecated parameter — Delete
use_label_encoder=Falsefrom the diabetes XGBClassifier call to eliminate the warning.
This project is licensed under the MIT License — see the LICENSE file for details.
Khadija Faisal
Github : khadijja1
Email : khadijafaysal444@gmail.com