Text-Classification

Abstract:

In this study, three methods were used to classify emails (as spam and not spam). The classification is done using k nearest neighbor (kNN), Naive Bayes and Artificial Neural Network. Naive Bayes and Neural Network gave good results. kNN failed to give any result.

Methadology:

1) Data Set Selection:

Dataset used here is a subset of Enron Email Dataset provided by Enron Cooperation. The subset contains 33687 emails out of which 16545 are not spam/ham and 17142 are spam. This dataset can found in my Google Drive. This dataset's main purpose is to compare different approaches of text classification. The code can be run on any dataset. The only difference is that first line (subject line) of every email is removed. It can be changed by removing line 18: handle.next() of ReadPreprocessData.py.

2) Feature Selection:

Each word in email is considered a feature.

3) Data Pre-processing:

Following preprocessing techniques were used in order:

Convert to String
Convert to Lowercase
Remove numbers and special characters
Remove stop words
Convert to sparse vector

4) Machine Learning Algorithm:

As Supervised learning approach was used, kNN, Naive Bayes and Artificial Neural Network were used to classify.

Results:

Good results were obtained from Naive Bayes and Artificial Neural Network. kNN failed to produce any results. Detailed results are given in Project-Report.

Conclusion:

To conclude, both Naive Bayes and Neural Network give good results. kNN is not meant for sparse data.

Contact

You can get in touch with me on my LinkedIn Profile: Farhan Shoukat

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LICENSE		LICENSE
MLP.py		MLP.py
NaiveBayes.py		NaiveBayes.py
Project-Report.docx		Project-Report.docx
README.md		README.md
ReadPreprocessData.py		ReadPreprocessData.py
SharedFunctions.py		SharedFunctions.py
Tensorflow.py		Tensorflow.py
Tokenize.py		Tokenize.py
kNN.py		kNN.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text-Classification

Abstract:

Methadology:

1) Data Set Selection:

2) Feature Selection:

3) Data Pre-processing:

4) Machine Learning Algorithm:

Results:

Conclusion:

Contact

License

About

Uh oh!

Releases

Packages

Languages

License

FarhanShoukat/Text-Classification

Folders and files

Latest commit

History

Repository files navigation

Text-Classification

Abstract:

Methadology:

1) Data Set Selection:

2) Feature Selection:

3) Data Pre-processing:

4) Machine Learning Algorithm:

Results:

Conclusion:

Contact

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages