Skip to content

A comprehensive comparision of kNN, Naive Bayes and Neural Network in Text Classification

License

Notifications You must be signed in to change notification settings

FarhanShoukat/Text-Classification

Repository files navigation

Text-Classification

Abstract:

In this study, three methods were used to classify emails (as spam and not spam). The classification is done using k nearest neighbor (kNN), Naive Bayes and Artificial Neural Network. Naive Bayes and Neural Network gave good results. kNN failed to give any result.

Methadology:

1) Data Set Selection:

Dataset used here is a subset of Enron Email Dataset provided by Enron Cooperation. The subset contains 33687 emails out of which 16545 are not spam/ham and 17142 are spam. This dataset can found in my Google Drive. This dataset's main purpose is to compare different approaches of text classification. The code can be run on any dataset. The only difference is that first line (subject line) of every email is removed. It can be changed by removing line 18: handle.next() of ReadPreprocessData.py.

2) Feature Selection:

Each word in email is considered a feature.

3) Data Pre-processing:

Following preprocessing techniques were used in order:

  • Convert to String
  • Convert to Lowercase
  • Remove numbers and special characters
  • Remove stop words
  • Convert to sparse vector

4) Machine Learning Algorithm:

As Supervised learning approach was used, kNN, Naive Bayes and Artificial Neural Network were used to classify.

Results:

Good results were obtained from Naive Bayes and Artificial Neural Network. kNN failed to produce any results. Detailed results are given in Project-Report.

Conclusion:

To conclude, both Naive Bayes and Neural Network give good results. kNN is not meant for sparse data.

Contact

You can get in touch with me on my LinkedIn Profile: Farhan Shoukat

License

MIT Copyright (c) 2018 Farhan Shoukat