This notebook covers the initialization of four common classification models, training on the iris dataset, evaluation of the models, and hyperparameter tuning to optimise the parameters of 3 models.
The iris dataset is a classification dataset which is small and perfect for beginer friendly classification tasks. It contains 150 samples of iris flowers, belonging to three species:
- Setosa
- Versicolor
- Virginica
Each sample has 4 features (all numerical and continuous):
| Feature | Description |
|---|---|
sepal length (cm) |
Length of sepal |
sepal width (cm) |
Width of sepal |
petal length (cm) |
Length of petal |
petal width (cm) |
Width of petal |
Each flower (row) is labeled with a target value:
0 = setosa, 1 = versicolor, 2 = virginica
-
Random Forest: A random forest classifier is a machine learning algorithm that builds multiple decision trees during training and combines their predictions to classify data. It's an ensemble method, meaning it leverages the collective intelligence of multiple models (the decision trees) to make more accurate and reliable predictions than a single decision tree could achieve. -
Support Vector Machine (SVM): A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression. It works by finding the optimal hyperplane that separates data points into different classes, maximizing the margin between them. -
K-Nearest Neighbors (KNN): The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. It is one of the popular and simplest classification and regression classifiers used in machine learning today. -
Naive Bayes: The Naรฏve Bayes classifier calculates the probability of a given instance belonging to a particular class based on the probabilities of its features. It assumes that the presence or absence of each feature is independent of the presence or absence of other features, which simplifies the calculations.
We saw in the notebook that all the models gave good results without hyperparameter tuning with the default parameters provided by scikit-learn. Since, the iris dataset is small (150 records), all four models reached an accuracy of 97% on the test dataset which had 30 samples.
-
Hyperparameter tuning is the process of finding the optimal set of hyperparameters for a machine learning model, which are parameters that are set before the learning process begins and influence how the model learns.
-
To tune the
Random Forest,SVM, andKNNmodels, we usedGridSearchCV, andRandomizedSearchCVto select the optimal hyperparameters for this dataset. -
We saw that for
Random Forest, the best hyperparameters were:
RandomForestClassifier(n_estimators=150, random_state=42)- For
SVM, the optimal hyperparameters are:
SVC(C=5.908361216819946, gamma='auto', kernel='linear', probability=True, random_state=42)- And, for
KNN, the hyperparameters are:
KNeighborsClassifier(metric='euclidean', n_neighbors=9, weights='distance')We can furthur see the difference of using Hyperparameter tuning in a large, noisy dataset. Since the iris dataset is small and not noisy, the default parameters and the optimized parameters yeilded the same results.