In this assignment, you will be implementing the K-Means algorithm from scratch. K-Means is one of the fundamental unsupervised learning algorithms that partition data into K-distinct clusters based on distance metrics.
You will be working with the classic Iris dataset to train your implementation and an extended iris dataset to test your implementation.
Additionally, you will be responsible for finding out how many clusters there actually are (no googling the answer of course ]: )!
For this assignment, we will be:
- Implementing KMeans clustering from scratch
- Using the algorithm to cluster the classic Iris dataset
- Creating visualizations to understand cluster performance
- Using the elbow method to determine optimal cluster numbers
The original Iris dataset contains 4 recorded features of the iris flower:
- Sepal length
- Sepal width
- Petal length
- Petal width
KMeans clustering works by:
- Randomly initializing K centroids
- Assigning points to the nearest centroid
- Updating centroid positions based on the mean of assigned points
- Repeating steps n-iterations until convergence
The algorithm uses distance metrics (typically Euclidean) to measure the similarity between points and centroids.
Once you've successfully created your KMeans algorithm, initialize your KMeans algorithm, fit and predict your model on the extended-iris dataset, choose a scoring method to use and plot it!
Note: Make sure you import the scoring method you chose.
One method you may remember from class is the elbow technique which helps you determine the optimal number of clusters (K) for KMeans clustering. It works by:
- Running KMeans with different values of K
- Calculating the inertia for each K
- Plotting K vs. inertia
- Finding the elbow point where increasing K yields diminishing returns
Creating KMeans:
- KMeans Algorithm Explanation
- Randomizing my centroid position
- Assigning data to centroids
- What is cdist?
Plotting your data and Finding K
Note: You do not need to modify anything in visualization.py.
As a reminder, try not to use ChatGPT to generate code, but have it suggest tools that may be helpful
-
Init method (5 points):
- Correctly initializes all parameters (5)
-
Fit method (25 points):
- Correct random centroid initialization (5)
- Correct implementation of distance (5)
- Correct cluster assignment (5)
- Correct centroid update mechanism (5)
- Correct convergence checking (5)
-
Predict method (5 points):
- Correct assignment of new points to clusters (5)
-
Helper Methods (10 points):
- get_error implementation (5)
- get_centroid implementation (5)
-
Evaluation (1):
- Model predicts centroid on new dataset (1)
-
Visualization (20):
- Picked the proper scoring method to evaluate the KMeans model (5)
- Utilized plot_3d_cluster to view clusters (3)
- Generated Elbow Plot (12)
-
Analysis (4):
- Correct K prediction (1):
- Valid K prediction reasoning (3):