This project implements a spam detection system using machine learning techniques to classify email messages as either spam or ham (not spam). The model is trained on a dataset of emails, utilizing natural language processing (NLP) for feature extraction and logistic regression for classification.
With the increasing volume of emails received daily, distinguishing between spam and legitimate messages is crucial for maintaining productivity and security. This project aims to automate the classification process, helping users filter out unwanted spam emails efficiently.
The dataset used for this project is a CSV file containing email messages labeled as either "spam" or "ham." The data is processed to handle missing values, and the categories are converted to numerical values for model training.
- Data Loading: The email data is loaded from a CSV file.
- Data Preprocessing:
- Null values are replaced.
- The 'Category' labels are encoded (spam as
0
and ham as1
).
- Train-Test Split: The data is split into training (80%) and testing (20%) sets.
- Feature Extraction:
- The
TfidfVectorizer
is used to convert the text messages into numerical feature vectors.
- The
- Model Training: A logistic regression model is trained using the feature vectors from the training set.
- Model Evaluation: The model's accuracy is assessed on both training and testing datasets.
- Prediction: The trained model can predict whether a new email message is spam or ham.
- Clone the repository:
git clone https://github.com/yourusername/spam-detection.git cd spam-detection
- Install the required libraries:
pip install numpy pandas scikit-learn
3.Place your dataset mail_data.csv
in the project directory.
##Usage To run the spam detection model, execute the following command in your terminal or command prompt:
python spam_detection.py
You can modify the input_mail
variable in the script to test with different email messages.
The model gives upto 96.70% accuracy over training data and 96.59% accuracy score for testing data.
To integrate this spam detection model into a web project, follow these steps:
Export the Model: After training, you can save the model using joblib or pickle:
import joblib
joblib.dump(model, 'spam_detection_model.pkl')
Create a Flask/Django Web API: Set up a simple web server using Flask or Django to handle incoming email messages.
Load the Model in Your API:
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model = joblib.load('spam_detection_model.pkl')
feature_extraction = joblib.load('tfidf_vectorizer.pkl') # Save your vectorizer similarly
@app.route('/predict', methods=['POST'])
def predict():
input_mail = request.json['message']
input_data_features = feature_extraction.transform([input_mail])
prediction = model.predict(input_data_features)
return jsonify({'prediction': 'Ham mail' if prediction[0] == 1 else 'Spam mail'})
Run Your Web Server:
flask run
Testing the API: Use tools like Postman to send POST requests with email messages to your API and receive predictions.
Spam email detection is a crucial task in email security, as it helps users avoid unsolicited and potentially harmful emails. In this project, we developed a machine learning model to automatically classify emails as spam or legitimate (ham). By leveraging a labelled dataset and employing preprocessing techniques like text cleaning and numerical representation, we trained and evaluated several machine learning algorithms. Despite the inherent challenges such as dataset quality and the presence of false positives/negatives, our model demonstrated promising performance. Through optimization and fine-tuning, we were able to achieve better results, albeit with some limitations. Overall, this project underscores the significance of machine learning in combating spam emails and highlights the continuous effort required to enhance email security and user experience.