Document Classifier

Overview

Document Classification with BERTopic and Streamlit is a project designed to automatically label documents based on the textual content present near key areas of interest. This repository demonstrates the development of a document classification system leveraging self-supervised learning with BERTopic and an interactive UI built using Streamlit.

Project Highlights

Self-Supervised Learning with BERTopic: Utilize BERTopic, a topic modeling algorithm, to group and classify documents based on semantic similarity.
Streamlit for Interactive UI: Provide a user-friendly interface to upload documents, view classifications, and interact with the model in real-time.
Flexibility and Adaptability: Easily extend the system to handle new, unlabeled datasets without requiring extensive retraining.
Efficient Document Labeling: Automatically classify documents into relevant categories based on context, reducing manual effort.

How it Works

Text Extraction

Text content is extracted from documents, with a focus on regions near key information areas.
Preprocessed text serves as input for further analysis.

Topic Modeling with BERTopic

BERTopic is trained on document text to generate embeddings and cluster documents based on semantic similarity.
Fine-tuning ensures that the model captures domain-specific nuances for classification.

Classification and Labeling

Trained BERTopic model assigns topic labels to documents.
Labels can be mapped to predefined categories for better interpretability.

Interactive UI

Streamlit application allows users to:
- Upload documents in PDF format.
- View extracted text and predicted labels.
- Explore clustering insights through interactive visualizations.

Prerequisites

Python: 3.x (Tested with Python 3.10.12 on Ubuntu 22.04)
Libraries: bertopic, streamlit, pandas, scikit-learn, nltk, PyPDF2
Tools:
- Streamlit for the UI
- Dependencies listed in requirements.txt

Usage

Clone the Repository:

git clone https://github.com/PavanSugreev04/Document-Classification.git
cd document-classification-bertopic

Install Dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Prepare Data:
- Place your training documents in the data/ directory.
- Preprocess text using the scripts provided.
Train BERTopic Model:
- Use train_model.py to train the BERTopic model on your dataset.
Launch Streamlit App:
```
streamlit run app.py
```
Interact with the App:
- Upload PDF documents.
- View extracted text, clusters, and topic labels.

Challenges & Remedies

Computational Constraints

Challenge: Training BERTopic on large datasets requires significant resources.
Remedy: Utilize cloud-based GPUs or pre-train on smaller datasets and fine-tune as needed.

Preprocessing Overhead

Challenge: Preprocessing large document sets can be time-intensive.
Remedy: Batch process documents and save intermediate results for reuse.

Data Quality

Challenge: Inconsistent or sparse text data affects model performance.
Remedy: Apply data augmentation and preprocessing to enhance data quality.

Hyperparameter Tuning

Challenge: Optimizing BERTopic parameters can be complex.
Remedy: Use grid search or automated tools to explore parameter configurations.

Data Source

Kaggal News DataSet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Document Classifier

Overview

Project Highlights

How it Works

Text Extraction

Topic Modeling with BERTopic

Classification and Labeling

Interactive UI

Prerequisites

Usage

Challenges & Remedies

Computational Constraints

Preprocessing Overhead

Data Quality

Hyperparameter Tuning

Data Source

Files

README.md

Latest commit

History

README.md

File metadata and controls

Document Classifier

Overview

Project Highlights

How it Works

Text Extraction

Topic Modeling with BERTopic

Classification and Labeling

Interactive UI

Prerequisites

Usage

Challenges & Remedies

Computational Constraints

Preprocessing Overhead

Data Quality

Hyperparameter Tuning

Data Source