A fun project exploring different machine learning models' ability to classify data in a Yin-Yang shape pattern. This project implements and compares various classification models to understand their performance on this geometrically interesting dataset.
The Yin-Yang dataset presents a visually complex, non-linear classification challenge with intertwined class regions. Its intricate structure challenges models to learn boundaries that are non-linear, nested, and curved—making it an excellent benchmark for comparing model expressiveness.
I also came across an interesting paper titled "The Yin-Yang Dataset", which introduces a compact and balanced dataset designed to support research in biologically plausible error backpropagation and deep learning within spiking neural networks.
it ain't much but it's honest work
- Configuration: 50 trees, varying
max_depthfrom 1 to 9 - Performance: Captures most of the points in major classes at lower depths but is able to learn the complete decision boundaries only at depth 9
This bad boy can fit so many f**king classes in it
- Configuration: 50 estimators, varying
max_depthfrom 1 to 3 - Performance: Shows solid performance even at low depths due to gradient boosting’s ability to combine weak learners.
MLP with Single Hidden Layer
- Hidden Units: 3 to 18
- Performance: Starts with baseline performance at low hidden units but improves with more neurons. Still, single-layer MLPs struggle to perfectly model the Yin-Yang’s nested, twisting structure.
MLP with Two Hidden Layers
- Hidden Layers: (2,2) to (12,12)
- Performance: Learns the boundaries with lower number of units per layer than MLP with single hidden layer. The second hidden layer allows the network to approximate more complex decision boundaries.
- Configuration: SVM with
rbf,linear,polyandsigmoidkernel and vary inverse regularization parameter between 0.1, 1 and 10. - Performance:
- RBF kernel captures the curved boundaries best.
- Linear and poly kernels underperform due to their limited flexibility.
- Sigmoid kernel gives unstable results in this context.
Ew ... brother ew...what's that brother
- Very high training time for kernels like RBF.
- Neighbors: 1 to 3
- Performance: Despite being simple, KNN performs surprisingly well on this dataset due to its instance-based nature. It handles the swirls of the Yin-Yang reasonably well.
- Performance: Assumption that x and y contribute independently to the probability of a class breaks down in this dataset.
Clustering algorithms are not suited for this task / dataset but we still try to visualize their behaviour below. Assignment of clusters to labels is done by Hungarian algorithm.
- Configuration: Varying number of clusters from 3 to 7 with
k-means++initialization. - Performance: Clustering algorithm not suitable for a highly non-linear classification problem.
Why am I even here?
- Epsilon: 0.1 to 0.3
- Performance: Density of points is similar throughtout dataset and hence this algorithm is highly unsuitable for this dataset.
See INSTALL.md for detailed installation and usage instructions.









