Skip to content

sanju234-san/Docs_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎯 DocClassify

Advanced AI-Powered Document Classification Platform

React Vite Python FastAPI Tailwind

Scikit-learn NLTK Pandas Framer Motion

Features β€’ Demo β€’ Installation β€’ Usage β€’ API β€’ Contributing

License Version PRs Welcome


πŸš€ An intelligent document classification platform that leverages machine learning to categorize documents with high accuracy and modern web interfaces


πŸ“‹ Table of Contents


🌟 Overview

DocClassify is a cutting-edge document classification platform that uses Support Vector Machine (SVM) with TF-IDF vectorization to categorize documents with high accuracy. Designed for businesses, researchers, and organizations, it combines sophisticated ML algorithms with a stunning, modern user interface.

Why Choose DocClassify?

Feature Benefit
🎯 High Accuracy SVM model with TF-IDF achieves reliable document categorization
⚑ Real-time Processing Instant classification with confidence scoring
πŸ“„ Multi-format Support Handles PDF, DOCX, and TXT files seamlessly
🎨 Modern UI/UX Clean design with smooth animations and responsive layout
πŸ”§ FastAPI Backend Scalable and fast API for document processing
πŸ“Š Confidence Metrics Detailed classification results with probability scores

✨ Features

🧠 Intelligent Core

SVM Classification

  • Support Vector Machine algorithm
  • TF-IDF vectorization for text features
  • Label encoding for categories
  • Confidence scoring system

Smart Analytics

  • Real-time classification results
  • Confidence level assessment
  • Processing time metrics
  • File format detection

🎨 Visual Experience

Design Excellence

  • Clean Interface: Modern design with intuitive navigation
  • Dark Theme: Professional aesthetic with high contrast
  • Responsive Layout: Mobile to desktop support
  • Smooth Animations: Framer Motion transitions

Interactive Elements

  • File Upload: Drag & drop interface
  • Progress Indicators: Real-time processing feedback
  • Results Display: Clear classification output
  • Error Handling: User-friendly error messages

πŸ“„ Document Processing

  • Multi-format Support: PDF, DOCX, TXT file processing
  • Text Extraction: Advanced parsing for different formats
  • Preprocessing Pipeline: Cleaning, tokenization, lemmatization
  • Batch Processing: Single file classification

πŸŽ₯ Demo

Classification Interface

image

Results Dashboard

image

File Upload Interface

image

πŸ›  Tech Stack

Frontend Technologies

Technology Purpose Version
React UI Framework 19.x
Vite Build Tool Latest
Tailwind Styling Latest
Framer Animations Latest
Axios HTTP Client Latest
React Router Routing Latest

Backend Technologies

Technology Purpose Version
Python Language 3.9+
FastAPI API Framework Latest
Scikit-learn ML Library Latest
NLTK NLP Processing Latest
Pandas Data Processing Latest
PyPDF2 PDF Processing Latest

πŸ“¦ Installation

Prerequisites

Before you begin, ensure you have the following installed:

# Check Node.js version (v18 or higher required)
node --version

# Check Python version (3.9 or higher required)
python --version

# Check Git
git --version

Required Software:

Quick Start Guide

Step 1: Clone the Repository

git clone https://github.com/yourusername/docclassify.git
cd docclassify

Step 2: Frontend Setup

# Navigate to client directory
cd client

# Install dependencies
npm install

# Start development server
npm run dev

βœ… Frontend will be running at http://localhost:5173

Step 3: Backend Setup

Open a new terminal window:

# Navigate to server directory
cd server

# Create virtual environment (recommended)
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Start FastAPI server
python app.py

βœ… Backend API will be running at http://localhost:8000

Step 4: Verify Installation

Visit http://localhost:5173 in your browser. You should see the DocClassify landing page.

Alternative: Concurrent Execution

Run both frontend and backend simultaneously with a single command:

# From root directory
npm install
npm run dev

This requires the root package.json to have concurrently configured.


πŸš€ Usage

Making Your First Classification

1. Navigate to Classification Page

Open your browser and go to http://localhost:5173/

2. Upload a Document

Upload a document file:

Supported Formats:
β”œβ”€β”€ PDF files (.pdf)
β”œβ”€β”€ Word documents (.docx)
β”œβ”€β”€ Text files (.txt)
└── Maximum file size: 10MB

3. Process & Analyze

Click "Classify Document" to receive:

  • βœ… Document Category (Predicted class)
  • πŸ“Š Confidence Score (0-100%)
  • 🎯 Classification Details
  • ⏱️ Processing Time

πŸ“‘ API Documentation

Base URL

http://localhost:8000

Endpoints

1. Health Check

GET /

Response:

{
  "status": "healthy",
  "service": "Document Classification API"
}

2. Classify Document

POST /classify
Content-Type: multipart/form-data

Request Body:

file: [UploadFile] - The document file to classify

Response:

{
  "filename": "document.pdf",
  "prediction": "Category A",
  "confidence": 0.87,
  "message": "Classification successful"
}

Error Responses

{
  "detail": "Error message here",
  "status_code": 400
}

Common Status Codes:

  • 200 - Success
  • 400 - Bad Request (Invalid file or processing error)
  • 500 - Internal Server Error

πŸ“‚ Project Structure

docclassify/
β”‚
β”œβ”€β”€ πŸ“ client/                          # React Frontend
β”‚   β”œβ”€β”€ πŸ“ public/                      # Static assets
β”‚   β”‚   β”œβ”€β”€ favicon.ico
β”‚   β”‚   └── logo.png
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ src/
β”‚   β”‚   β”œβ”€β”€ πŸ“ components/              # Reusable components
β”‚   β”‚   β”‚   β”œβ”€β”€ Footer.jsx              # Footer component
β”‚   β”‚   β”‚   β”œβ”€β”€ Navbar.jsx              # Navigation bar
β”‚   β”‚   β”‚   └── [Other components]
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ πŸ“ pages/                   # Page components
β”‚   β”‚   β”‚   β”œβ”€β”€ Home.jsx                # Home page
β”‚   β”‚   β”‚   β”œβ”€β”€ Layout.jsx              # Layout wrapper
β”‚   β”‚   β”‚   └── NotFound.jsx            # 404 page
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ πŸ“ hooks/                   # Custom React hooks
β”‚   β”‚   β”‚   └── [Custom hooks]
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ πŸ“ utils/                   # Utility functions
β”‚   β”‚   β”‚   └── [Utility files]
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ App.jsx                     # Main app component
β”‚   β”‚   β”œβ”€β”€ main.jsx                    # Entry point
β”‚   β”‚   └── [Other files]
β”‚   β”‚
β”‚   β”œβ”€β”€ package.json                    # Frontend dependencies
β”‚   β”œβ”€β”€ vite.config.js                  # Vite configuration
β”‚   └── [Other config files]
β”‚
β”œβ”€β”€ πŸ“ server/                          # FastAPI Backend
β”‚   β”œβ”€β”€ πŸ“ models/                      # Trained ML models
β”‚   β”‚   β”œβ”€β”€ svm_model.pkl               # SVM model
β”‚   β”‚   β”œβ”€β”€ tfidf_vectorizer.pkl        # TF-IDF vectorizer
β”‚   β”‚   └── label_encoder.pkl           # Label encoder
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ training/                    # ML training scripts
β”‚   β”‚   β”œβ”€β”€ eda.py                      # Exploratory data analysis
β”‚   β”‚   β”œβ”€β”€ preprocessing.py            # Data preprocessing
β”‚   β”‚   β”œβ”€β”€ model.py                    # Model training
β”‚   β”‚   └── test.py                     # Model testing
β”‚   β”‚
β”‚   β”œβ”€β”€ app.py                          # Main FastAPI app
β”‚   β”œβ”€β”€ requirements.txt                # Python dependencies
β”‚   └── [Other files]
β”‚
β”œβ”€β”€ πŸ“ docs/                            # Documentation
β”‚   β”œβ”€β”€ API.md                          # API documentation
β”‚   └── [Other docs]
β”‚
β”œβ”€β”€ .gitignore                          # Git ignore rules
β”œβ”€β”€ LICENSE                             # MIT License
β”œβ”€β”€ README.md                           # This file
└── package.json                        # Root package (scripts)

πŸ§ͺ Model Training

Training Your SVM Model

Prepare Your Dataset

Ensure your CSV file has the following columns:

text, category

Run Training Scripts

cd server/training

# Perform EDA
python eda.py --dataset ../data/your_dataset.csv

# Preprocess data
python preprocessing.py --input ../data/your_dataset.csv --output ../data/processed_data.csv

# Train model
python model.py --dataset ../data/processed_data.csv --output ../models/

# Test model
python test.py --model ../models/svm_model.pkl --data ../data/test_data.csv

Available Parameters

# EDA
--dataset       # Path to training data CSV

# Preprocessing
--input         # Input CSV file
--output        # Output processed CSV file

# Model Training
--dataset       # Path to processed data CSV
--output        # Directory to save trained models
--test-size     # Test split ratio (default: 0.2)

# Testing
--model         # Path to trained model
--data          # Path to test data

Model Performance Metrics

Current SVM Performance

Metric Value
Accuracy 85.2%
Precision 83.1%
Recall 84.5%
F1-Score 83.8%

βš™οΈ Configuration

Frontend Configuration

API Endpoint (client/src/utils/api.js)

const API_BASE_URL = import.meta.env.VITE_API_URL || 'http://localhost:8000';

Backend Configuration

Server Config (server/app.py)

# API Configuration
HOST = "0.0.0.0"
PORT = 8000

# Model Configuration
MODEL_PATH = "models/svm_model.pkl"
VECTORIZER_PATH = "models/tfidf_vectorizer.pkl"
ENCODER_PATH = "models/label_encoder.pkl"

# CORS Settings
ALLOWED_ORIGINS = ["http://localhost:5173"]

Environment Variables

Create .env files:

Frontend (.env):

VITE_API_URL=http://localhost:8000

Backend (.env):

DEBUG=True
MODEL_PATH=./models

🎨 Customization

Changing Color Scheme

Update Theme Variables

Edit client/src/styles/theme.css:

:root {
  /* Primary Colors */
  --primary: #6366f1;
  --secondary: #8b5cf6;
  --accent: #ec4899;

  /* Background */
  --background: #0f0f0f;
  --surface: #1a1a1a;

  /* Text */
  --text-primary: #ffffff;
  --text-secondary: #a0a0b0;
}

Adjusting Model Parameters

Edit server/training/model.py:

# SVM Configuration
C = 1.0                    # Regularization parameter
kernel = 'linear'          # Kernel type
gamma = 'scale'            # Kernel coefficient

# TF-IDF Configuration
max_features = 5000        # Maximum features
ngram_range = (1, 2)       # N-gram range

🀝 Contributing

We welcome contributions from the community! Here's how you can help:

Getting Started

  1. Fork the Repository

    # Click the 'Fork' button on GitHub
  2. Clone Your Fork

    git clone https://github.com/YOUR_USERNAME/docclassify.git
    cd docclassify
  3. Create a Branch

    git checkout -b feature/AmazingFeature
  4. Make Your Changes

    • Write clean, documented code
    • Follow existing code style
    • Add tests if applicable
  5. Commit Your Changes

    git add .
    git commit -m 'Add some AmazingFeature'
  6. Push to Your Fork

    git push origin feature/AmazingFeature
  7. Open a Pull Request

    • Go to the original repository
    • Click 'New Pull Request'
    • Describe your changes

Development Guidelines

  • βœ… Follow the existing code style and conventions
  • βœ… Write meaningful commit messages
  • βœ… Add comments for complex logic
  • βœ… Update documentation as needed
  • βœ… Test your changes thoroughly
  • βœ… Ensure all tests pass before submitting PR

Code Style

JavaScript/React:

  • Use functional components with hooks
  • Follow Airbnb JavaScript Style Guide
  • Use meaningful variable names
  • Add JSDoc comments for functions

Python:

  • Follow PEP 8 style guide
  • Use type hints
  • Add docstrings to functions
  • Keep functions small and focused

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2025 Nimit Gupta & Sanjeevni Dhir

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

πŸ‘¨β€πŸ’» Authors

Nimit Gupta
Nimit Gupta
Lead Developer
Sanjeevni Dhir
Sanjeevni Dhir
Co-Developer

πŸ™ Acknowledgments

We would like to thank the following projects and communities:

  • Scikit-learn - Machine learning algorithms and utilities
  • NLTK - Natural language processing toolkit
  • React - UI library for building interactive interfaces
  • FastAPI - Modern, fast web framework for Python
  • Tailwind CSS - Utility-first CSS framework
  • Framer Motion - Production-ready animation library
  • Vite - Next generation frontend tooling

Special Thanks

  • The open-source community for incredible tools and libraries
  • Contributors who help improve this project
  • Users who provide valuable feedback

πŸ“Š Roadmap

Upcoming Features

  • SVM-based document classification
  • Multi-format file support (PDF, DOCX, TXT)
  • Docker containerization (Q1 2025)
  • CI/CD pipeline setup (Q1 2025)
  • User authentication system (Q2 2025)
  • Batch file processing (Q2 2025)
  • Advanced analytics dashboard (Q3 2025)
  • Mobile app (React Native) (Q3 2025)
  • Model comparison tools (Q4 2025)
  • API rate limiting (Q4 2025)
  • Multi-language support (Q4 2025)

Long-term Vision

  • Enterprise document management platform
  • Integration with cloud storage services
  • Advanced NLP features (sentiment analysis, keyword extraction)
  • Real-time collaboration features
  • Custom model training interface
  • API marketplace for document processing

πŸ“ž Support

Getting Help

If you need help with DocClassify, here are your options:

FAQ

Q: What file formats are supported?
A: DocClassify currently supports PDF, DOCX, and TXT files.

Q: What's the maximum file size?
A: The current limit is 10MB per file, but this can be configured.

Q: How accurate is the classification?
A: Our SVM model achieves around 85% accuracy, but this depends on your training data.

Q: Can I train my own model?
A: Yes, use the training scripts in the server/training/ directory.

Q: Is my data secure?
A: Files are processed locally and not stored permanently. For production use, implement proper security measures.


🌐 Community

Join our growing community of developers and contributors:

  • ⭐ Star this repo if you find it helpful
  • 🍴 Fork and contribute to make it better
  • πŸ“£ Share with others who might benefit
  • πŸ’¬ Join discussions to share ideas and feedback

πŸ“ˆ Project Stats

GitHub stars GitHub forks GitHub watchers

GitHub issues GitHub pull requests GitHub last commit


Made with ❀️ by Nimit Gupta & Sanjeevni Dhir

DocClassify - Transforming Document Management with AI

⬆ Back to Top

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published