ML Benchmark: Supervised & Unsupervised Learning on 22 Datasets

A comprehensive machine learning benchmark comparing 6 models (3 supervised, 3 unsupervised) across 22 real-world datasets, with rigorous cross-validation, hyperparameter tuning, and comparison against published baselines.

Overview

This project systematically evaluates and compares classic machine learning algorithms — both supervised and unsupervised — across a diverse collection of UCI datasets. It includes two handmade implementations (MLP and SOM) built from scratch using only NumPy, alongside scikit-learn models.

Results are benchmarked against the reference paper:

"Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?" — Fernández-Delgado et al.

Models

Supervised

Model	Library	Tuned Hyperparameters
Logistic Regression	scikit-learn	`C` ∈ {0.01, 0.1, 1, 10}
Random Forest	scikit-learn	`n_estimators`, `max_depth`
MLP (from scratch)	NumPy only	`hidden_size`, `learning_rate`

Unsupervised

Model	Library	Tuned Hyperparameters
K-Means	scikit-learn	`n_clusters` ∈ {2, 3, 4}
Gaussian Mixture Model	scikit-learn	`n_components` ∈ {2, 3, 4, 5}
SOM (from scratch)	NumPy only	`grid_size`, `learning_rate`

Datasets

22 datasets from the UCI Machine Learning Repository:

Dataset	Samples	Features	Classes
Iris	150	4	3
Wine	178	13	3
Waveform	5,000	21	3
Car Evaluation	1,728	6	4
Titanic	2,201	3–4	2
Ionosphere	351	34	2
Dermatology	366	34	6
Ecoli	336	7	8
Zoo	101	16	7
Page Blocks	5,473	10	5
Spambase	4,601	57	2
Acute Inflammation	120	6	2
Vertebral Column (2 & 3 classes)	310	6	2/3
Tic-Tac-Toe	958	9	2
Synthetic Control	600	60	6
Statlog Vehicle	846	18	4
Statlog Heart	270	13	2
Seeds	210	7	3
Balance Scale	625	4	3
Acute Nephritis	120	6	2
Heart (Hungarian)	294	13	2

Methodology

Preprocessing (leak-free pipeline)

Missing/infinite values → imputed with column mean (fit on train only)
Categorical features → LabelEncoder
Feature scaling → StandardScaler (fit on train, applied to val/test)
Constant columns and ID-like columns are automatically removed

Validation Strategy

80/20 train-test split (stratified)
5-fold Stratified Cross-Validation for hyperparameter tuning
Fallback strategy for rare classes (< 5 samples per class)
All preprocessing is re-fit inside each CV fold to prevent data leakage

Metrics

Supervised: Accuracy, F1-macro, Cohen's Kappa
Unsupervised: Accuracy (via majority-vote cluster mapping), F1-macro, Cohen's Kappa, Silhouette Score, Davies-Bouldin Index
For imbalanced datasets (ratio > 2): F1-macro is used as the selection criterion

Results

Average performance across all 22 datasets

Model	Accuracy	F1-macro	Cohen's κ
Logistic Regression	90.7%	87.6%	81.8%
Random Forest	90.4%	87.9%	81.7%
MLP (from scratch)	87.0%	80.0%	76.4%
GMM	78.2%	65.4%	56.8%
K-Means	75.4%	62.7%	53.6%
SOM (from scratch)	71.8%	59.1%	46.4%

Comparison with published baseline (Fernández-Delgado et al.)

Model	Paper Accuracy	Our Accuracy	Paper κ	Our κ
Logistic Regression	76.5%	90.7%	53.7%	81.8%
Random Forest	80.0%	90.4%	57.4%	81.7%
MLP	80.3%	87.0%	62.0%	76.4%

Note: The paper averages over 121 datasets. Our subset of 22 was curated, which partially explains the performance gap.

Project Structure

.
├── data/                          # Dataset folders (one per dataset)
│   ├── iris/
│   │   └── iris.arff
│   ├── wine/
│   │   └── wine.data
│   └── ...
├── mini_project_notebook.ipynb    # Main notebook
├── final_results_with_tuning.csv  # Full results table
├── best_hyperparameters.csv       # Best hyperparams per dataset/model
└── README.md

Installation

git clone https://github.com/YOUR_USERNAME/ml-benchmark
cd ml-benchmark
pip install numpy pandas scikit-learn scipy

Python 3.10+ recommended. No deep learning framework required — the MLP and SOM are implemented from scratch.

Usage

Open and run the notebook:

jupyter notebook mini_project_notebook.ipynb

The notebook is self-contained and will:

Load all 22 datasets
Run cross-validated hyperparameter tuning for all 6 models
Evaluate on held-out test sets
Save results to CSV

Key Findings

Random Forest and Logistic Regression are the strongest overall performers, exceeding published baselines by ~10 percentage points in accuracy.
MLP from scratch is competitive with scikit-learn baselines despite being a pure NumPy implementation.
Clustering models (KMeans, GMM) perform well only on datasets with clear geometric structure (Wine, Iris, Acute datasets). They fail on datasets with overlapping classes.
SOM is the weakest model overall, especially on high-dimensional or imbalanced datasets (negative silhouette scores on many datasets).
Dataset characteristics — number of classes, dimensionality, class balance — are the dominant factors in model performance differences.

Author

Yassine TAHARASTE
INSA Lyon Student

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
ProjectNotebook.ipynb		ProjectNotebook.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Benchmark: Supervised & Unsupervised Learning on 22 Datasets

Table of Contents

Overview

Models

Supervised

Unsupervised

Datasets

Methodology

Preprocessing (leak-free pipeline)

Validation Strategy

Metrics

Results

Average performance across all 22 datasets

Comparison with published baseline (Fernández-Delgado et al.)

Project Structure

Installation

Usage

Key Findings

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ML Benchmark: Supervised & Unsupervised Learning on 22 Datasets

Table of Contents

Overview

Models

Supervised

Unsupervised

Datasets

Methodology

Preprocessing (leak-free pipeline)

Validation Strategy

Metrics

Results

Average performance across all 22 datasets

Comparison with published baseline (Fernández-Delgado et al.)

Project Structure

Installation

Usage

Key Findings

Author

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages