Skip to content

A program that can classify emails as spam or not spam using machine learning algorithms.

License

Notifications You must be signed in to change notification settings

PuruSinghvi/Spam-Email-Classifier

Repository files navigation

Spam Email Classifier

A program that can classify emails as spam or not spam using machine learning algorithms.
This project was made during the Compozent internship in Machine Learning and Artificial Intelligence.

Table of Contents
  1. Datasets Used
  2. Algorithms Used
  3. License

Datasets Used

The combined dataset was built using these two datasets:

  1. 2007 TREC Public Spam Corpus
    File name: email_text.csv
    Original link: https://plg.uwaterloo.ca/~gvcormac/treccorpus07/
    Preprocessed download link: https://www.kaggle.com/datasets/bayes2003/emails-for-spam-or-ham-classification-trec-2007

  2. Enron-Spam Dataset
    File name: enron_spam_data.csv
    Original link: https://www2.aueb.gr/users/ion/data/enron-spam/
    Preprocessed download link: https://github.com/MWiechmann/enron_spam_data/

Algorithms Used

TF-IDF Vectorizer

TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency. This is a very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine learning algorithms for prediction.
In a CountVectorizer, we only count the number of times a word appears in the document which results in biasing in favour of most frequent words. This ends up in ignoring rare words which could have helped in processing our data more efficiently.
A TF-IDF Vectorizer gives more importance to words that are unique and more likely to be indicative of spam or non-spam content.
Therefore, TF-IDF Vectorizer is often preferred for spam email classification due to its ability to capture the relative importance of words and distinguish between spam and non-spam content more effectively.

Support Vector Machine

Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and regression. They are known for their ability to learn complex patterns from data and perform well on both linear and non-linear problems.
SVMs are widely used in spam filtering to distinguish between legitimate emails and spam messages. This is because it generally achieves higher classification accuracy and is robust to noise and outliers in the data.

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Releases

No releases published

Packages

No packages published