This project demonstrates text classification using the Twenty Newsgroups dataset with both a pre-built Multinomial Naive Bayes model from Scikit-learn and a custom Naive Bayes implementation from scratch. The aim is to compare the performance of the custom implementation with the Scikit-learn model.
The project is divided into the following steps:
-
Load and Preprocess Data:
- Download the Twenty Newsgroups dataset using
fetch_20newsgroupsfrom Scikit-learn. - Preprocess the text data into numerical features using
TfidfVectorizer.
- Download the Twenty Newsgroups dataset using
-
Multinomial Naive Bayes with Scikit-learn:
- Train a Multinomial Naive Bayes model using Scikit-learn's
MultinomialNB. - Evaluate the model on the test data using accuracy, precision, recall, and F1-score.
- Train a Multinomial Naive Bayes model using Scikit-learn's
-
Custom Naive Bayes Implementation:
- Implement a Naive Bayes classifier from scratch.
- Train the custom Naive Bayes model on the training data.
- Evaluate the custom model on the test data using the same metrics as the Scikit-learn model.
-
Comparison:
- Compare the performance of the Scikit-learn model with the custom implementation.
- Clone the repository:
git clone https://github.com/dhruvpal102005/naive-bayes-text-classification.git cd naive-bayes-text-classification - Install the required dependencies:
pip install -r requirements.txt
The results section in the notebook provides a detailed comparison of the accuracy, precision, recall, and F1-score of both the Scikit-learn model and the custom Naive Bayes implementation. This comparison helps in understanding the differences and performance characteristics of both approaches.