3 classifiers for the LingSpam dataset, using tf-idf features, written in C++.
- k-NN classifier
- Naive Bayes classifier
- Baseline classifier
The k-NN classifier either uses Euclidean distances or cosine similarity as the metric measure.
The Baseline classifier is a dummy classifier that either classifies all the data with the most frequent label in the training set or with random labels altogether.
The program is tested on a Linux machine.
Run the script "compile.sh". Type:
./compile.sh
- First construct the dataset. Run:
./bin/construct_dataset.o
- Then, classify the dataset, with the 3 classifiers. Run:
./bin/Main.o
Classifier | Accuracy | Precision | Recall | test | wrong | TP | TN | FP | FN |
---|---|---|---|---|---|---|---|---|---|
10-NN Classifier using Euclidean distances metric | 91.35 % | 65.75 % | 100 % | 289 | 25 | 48 | 25 | 216 | 0 |
1-NN Classifier using Euclidean distances metric | 88.93 % | 60 % | 100 % | 289 | 32 | 48 | 32 | 209 | 0 |
10-NN Classifier using Cosine similarity metric | 93.08 % | 71.88 % | 95.83 % | 289 | 20 | 46 | 18 | 223 | 2 |
1-NN Classifier using Cosine similarity metric | 78.2 % | 37.29 % | 45.83 % | 289 | 63 | 22 | 37 | 204 | 26 |
Naive Bayes Classifier | 96.19 % | 93.02 % | 83.33 % | 289 | 11 | 40 | 3 | 238 | 8 |
Baseline Classifier (Most Frequent label strategy) | 83.4 % | 65.75 % | 0 % | 289 | 48 | 0 | 0 | 241 | 48 |
Baseline Classifier (Random labels strategy) | 46.71 % | 13.7 % | 41.67 % | 289 | 154 | 20 | 126 | 115 | 28 |