This project implements a neural network-based SMS spam classification system using TensorFlow and Keras. The goal is to accurately classify SMS messages as either "ham" (normal) or "spam" (unwanted advertisements or messages from companies).
The project follows these main steps:
- Data Preprocessing
- Model Creation
- Model Training
- Message Prediction
- The dataset is loaded from TSV files containing labeled SMS messages.
- Text data is cleaned and tokenized using NLTK.
- Messages are converted to sequences of word indices and padded to ensure uniform length.
- Labels are converted to numerical values (0 for ham, 1 for spam).
Key concepts:
- Tokenization
- Sequence padding
- Text normalization
A neural network is built using TensorFlow/Keras with the following architecture:
- Embedding layer for learning word representations
- LSTM layers for sequence processing
- Dense layers with ReLU activation
- Dropout for regularization
- Final Dense layer with sigmoid activation for binary classification
Key concepts:
- Word embeddings
- Recurrent Neural Networks (LSTM)
- Dropout regularization
- Data is split into training and validation sets.
- The model is trained using binary cross-entropy loss and Adam optimizer.
- Class weights are applied to handle potential class imbalance.
- Training progress is monitored using accuracy and loss metrics.
Key concepts:
- Train-validation split
- Loss functions
- Optimization algorithms
- Class weighting
A predict_message
function is implemented to:
- Preprocess input messages
- Use the trained model for prediction
- Return the spam probability and corresponding label
The model is evaluated on a test set, and performance metrics such as accuracy, precision, recall, and F1-score can be calculated.
- Experiment with different model architectures (e.g., CNN, Transformer)
- Use advanced text preprocessing techniques (lemmatization, stemming)
- Implement data augmentation for text data
- Explore ensemble methods
This project demonstrates the application of natural language processing and deep learning techniques to solve a real-world problem of SMS spam classification.