In this study, three methods were used to classify emails (as spam and not spam). The classification is done using k nearest neighbor (kNN), Naive Bayes and Artificial Neural Network. Naive Bayes and Neural Network gave good results. kNN failed to give any result.
Dataset used here is a subset of Enron Email Dataset provided by Enron Cooperation. The subset contains 33687 emails out of which 16545 are not spam/ham and 17142 are spam. This dataset can found in my Google Drive. This dataset's main purpose is to compare different approaches of text classification. The code can be run on any dataset. The only difference is that first line (subject line) of every email is removed. It can be changed by removing line 18: handle.next() of ReadPreprocessData.py.
Each word in email is considered a feature.
Following preprocessing techniques were used in order:
- Convert to String
- Convert to Lowercase
- Remove numbers and special characters
- Remove stop words
- Convert to sparse vector
As Supervised learning approach was used, kNN, Naive Bayes and Artificial Neural Network were used to classify.
Good results were obtained from Naive Bayes and Artificial Neural Network. kNN failed to produce any results. Detailed results are given in Project-Report.
To conclude, both Naive Bayes and Neural Network give good results. kNN is not meant for sparse data.
You can get in touch with me on my LinkedIn Profile: Farhan Shoukat
MIT Copyright (c) 2018 Farhan Shoukat