Databases : SMS Spam Collection Dataset (https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)
Technologies/Frameworks: Python, Pandas, Numpy, Sklearn, streamlit (Frontend Framework), matplotlib, nltk, seaborn,
- Data Cleaning
- renaming
- missing values
- remove duplicates
- EDA (to understand underlying data)
- plotting charts (ham vs spam)
- wordcount
- Text Pre-Processing (with the help of nltk library)
- Lower case
- Tokenization
- Removing special characters
- Removing stop words and punctuation
- Stemming
- Model Building (with the help of sklearn library)
- train-test data
- tfidf vectorization
- model training on various ML classifiers
- Evaluation
- compare and choose on best model
- Improvement
- re-train model by hyper parameter tuning (here TfidfVectorizer(max_features=3000))
- Website
- create & open a project in editor
- crete & code app.py file
- import .pkl files & functions from ipynb file
- integrate it in streamlit
Local URL: http://localhost:8501
Network URL: http://192.168.1.113:8501
Example Outputs: