Features β’ Demo β’ Installation β’ Usage β’ API β’ Contributing
- Overview
- Features
- Demo
- Tech Stack
- Installation
- Usage
- API Documentation
- Project Structure
- Model Training
- Configuration
- Contributing
- License
- Authors
- Acknowledgments
DocClassify is a cutting-edge document classification platform that uses Support Vector Machine (SVM) with TF-IDF vectorization to categorize documents with high accuracy. Designed for businesses, researchers, and organizations, it combines sophisticated ML algorithms with a stunning, modern user interface.
| Feature | Benefit |
|---|---|
| π― High Accuracy | SVM model with TF-IDF achieves reliable document categorization |
| β‘ Real-time Processing | Instant classification with confidence scoring |
| π Multi-format Support | Handles PDF, DOCX, and TXT files seamlessly |
| π¨ Modern UI/UX | Clean design with smooth animations and responsive layout |
| π§ FastAPI Backend | Scalable and fast API for document processing |
| π Confidence Metrics | Detailed classification results with probability scores |
|
|
|
|
- Multi-format Support: PDF, DOCX, TXT file processing
- Text Extraction: Advanced parsing for different formats
- Preprocessing Pipeline: Cleaning, tokenization, lemmatization
- Batch Processing: Single file classification
| Technology | Purpose | Version |
|---|---|---|
| UI Framework | 19.x | |
| Build Tool | Latest | |
| Styling | Latest | |
| Animations | Latest | |
| HTTP Client | Latest | |
| Routing | Latest |
| Technology | Purpose | Version |
|---|---|---|
| Language | 3.9+ | |
| API Framework | Latest | |
| ML Library | Latest | |
| NLP Processing | Latest | |
| Data Processing | Latest | |
| PDF Processing | Latest |
Before you begin, ensure you have the following installed:
# Check Node.js version (v18 or higher required)
node --version
# Check Python version (3.9 or higher required)
python --version
# Check Git
git --versionRequired Software:
git clone https://github.com/yourusername/docclassify.git
cd docclassify# Navigate to client directory
cd client
# Install dependencies
npm install
# Start development server
npm run devβ
Frontend will be running at http://localhost:5173
Open a new terminal window:
# Navigate to server directory
cd server
# Create virtual environment (recommended)
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Start FastAPI server
python app.pyβ
Backend API will be running at http://localhost:8000
Visit http://localhost:5173 in your browser. You should see the DocClassify landing page.
Run both frontend and backend simultaneously with a single command:
# From root directory
npm install
npm run devThis requires the root package.json to have concurrently configured.
Open your browser and go to http://localhost:5173/
Upload a document file:
Supported Formats:
βββ PDF files (.pdf)
βββ Word documents (.docx)
βββ Text files (.txt)
βββ Maximum file size: 10MB
Click "Classify Document" to receive:
- β Document Category (Predicted class)
- π Confidence Score (0-100%)
- π― Classification Details
- β±οΈ Processing Time
http://localhost:8000
GET /Response:
{
"status": "healthy",
"service": "Document Classification API"
}POST /classify
Content-Type: multipart/form-dataRequest Body:
file: [UploadFile] - The document file to classify
Response:
{
"filename": "document.pdf",
"prediction": "Category A",
"confidence": 0.87,
"message": "Classification successful"
}{
"detail": "Error message here",
"status_code": 400
}Common Status Codes:
200- Success400- Bad Request (Invalid file or processing error)500- Internal Server Error
docclassify/
β
βββ π client/ # React Frontend
β βββ π public/ # Static assets
β β βββ favicon.ico
β β βββ logo.png
β β
β βββ π src/
β β βββ π components/ # Reusable components
β β β βββ Footer.jsx # Footer component
β β β βββ Navbar.jsx # Navigation bar
β β β βββ [Other components]
β β β
β β βββ π pages/ # Page components
β β β βββ Home.jsx # Home page
β β β βββ Layout.jsx # Layout wrapper
β β β βββ NotFound.jsx # 404 page
β β β
β β βββ π hooks/ # Custom React hooks
β β β βββ [Custom hooks]
β β β
β β βββ π utils/ # Utility functions
β β β βββ [Utility files]
β β β
β β βββ App.jsx # Main app component
β β βββ main.jsx # Entry point
β β βββ [Other files]
β β
β βββ package.json # Frontend dependencies
β βββ vite.config.js # Vite configuration
β βββ [Other config files]
β
βββ π server/ # FastAPI Backend
β βββ π models/ # Trained ML models
β β βββ svm_model.pkl # SVM model
β β βββ tfidf_vectorizer.pkl # TF-IDF vectorizer
β β βββ label_encoder.pkl # Label encoder
β β
β βββ π training/ # ML training scripts
β β βββ eda.py # Exploratory data analysis
β β βββ preprocessing.py # Data preprocessing
β β βββ model.py # Model training
β β βββ test.py # Model testing
β β
β βββ app.py # Main FastAPI app
β βββ requirements.txt # Python dependencies
β βββ [Other files]
β
βββ π docs/ # Documentation
β βββ API.md # API documentation
β βββ [Other docs]
β
βββ .gitignore # Git ignore rules
βββ LICENSE # MIT License
βββ README.md # This file
βββ package.json # Root package (scripts)
Ensure your CSV file has the following columns:
text, category
cd server/training
# Perform EDA
python eda.py --dataset ../data/your_dataset.csv
# Preprocess data
python preprocessing.py --input ../data/your_dataset.csv --output ../data/processed_data.csv
# Train model
python model.py --dataset ../data/processed_data.csv --output ../models/
# Test model
python test.py --model ../models/svm_model.pkl --data ../data/test_data.csv# EDA
--dataset # Path to training data CSV
# Preprocessing
--input # Input CSV file
--output # Output processed CSV file
# Model Training
--dataset # Path to processed data CSV
--output # Directory to save trained models
--test-size # Test split ratio (default: 0.2)
# Testing
--model # Path to trained model
--data # Path to test data| Metric | Value |
|---|---|
| Accuracy | 85.2% |
| Precision | 83.1% |
| Recall | 84.5% |
| F1-Score | 83.8% |
const API_BASE_URL = import.meta.env.VITE_API_URL || 'http://localhost:8000';# API Configuration
HOST = "0.0.0.0"
PORT = 8000
# Model Configuration
MODEL_PATH = "models/svm_model.pkl"
VECTORIZER_PATH = "models/tfidf_vectorizer.pkl"
ENCODER_PATH = "models/label_encoder.pkl"
# CORS Settings
ALLOWED_ORIGINS = ["http://localhost:5173"]Create .env files:
Frontend (.env):
VITE_API_URL=http://localhost:8000Backend (.env):
DEBUG=True
MODEL_PATH=./modelsEdit client/src/styles/theme.css:
:root {
/* Primary Colors */
--primary: #6366f1;
--secondary: #8b5cf6;
--accent: #ec4899;
/* Background */
--background: #0f0f0f;
--surface: #1a1a1a;
/* Text */
--text-primary: #ffffff;
--text-secondary: #a0a0b0;
}Edit server/training/model.py:
# SVM Configuration
C = 1.0 # Regularization parameter
kernel = 'linear' # Kernel type
gamma = 'scale' # Kernel coefficient
# TF-IDF Configuration
max_features = 5000 # Maximum features
ngram_range = (1, 2) # N-gram rangeWe welcome contributions from the community! Here's how you can help:
-
Fork the Repository
# Click the 'Fork' button on GitHub -
Clone Your Fork
git clone https://github.com/YOUR_USERNAME/docclassify.git cd docclassify -
Create a Branch
git checkout -b feature/AmazingFeature
-
Make Your Changes
- Write clean, documented code
- Follow existing code style
- Add tests if applicable
-
Commit Your Changes
git add . git commit -m 'Add some AmazingFeature'
-
Push to Your Fork
git push origin feature/AmazingFeature
-
Open a Pull Request
- Go to the original repository
- Click 'New Pull Request'
- Describe your changes
- β Follow the existing code style and conventions
- β Write meaningful commit messages
- β Add comments for complex logic
- β Update documentation as needed
- β Test your changes thoroughly
- β Ensure all tests pass before submitting PR
JavaScript/React:
- Use functional components with hooks
- Follow Airbnb JavaScript Style Guide
- Use meaningful variable names
- Add JSDoc comments for functions
Python:
- Follow PEP 8 style guide
- Use type hints
- Add docstrings to functions
- Keep functions small and focused
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2025 Nimit Gupta & Sanjeevni Dhir
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
![]() Nimit Gupta Lead Developer |
![]() Sanjeevni Dhir Co-Developer |
We would like to thank the following projects and communities:
- Scikit-learn - Machine learning algorithms and utilities
- NLTK - Natural language processing toolkit
- React - UI library for building interactive interfaces
- FastAPI - Modern, fast web framework for Python
- Tailwind CSS - Utility-first CSS framework
- Framer Motion - Production-ready animation library
- Vite - Next generation frontend tooling
- The open-source community for incredible tools and libraries
- Contributors who help improve this project
- Users who provide valuable feedback
- SVM-based document classification
- Multi-format file support (PDF, DOCX, TXT)
- Docker containerization (Q1 2025)
- CI/CD pipeline setup (Q1 2025)
- User authentication system (Q2 2025)
- Batch file processing (Q2 2025)
- Advanced analytics dashboard (Q3 2025)
- Mobile app (React Native) (Q3 2025)
- Model comparison tools (Q4 2025)
- API rate limiting (Q4 2025)
- Multi-language support (Q4 2025)
- Enterprise document management platform
- Integration with cloud storage services
- Advanced NLP features (sentiment analysis, keyword extraction)
- Real-time collaboration features
- Custom model training interface
- API marketplace for document processing
If you need help with DocClassify, here are your options:
-
π§ Email: Contact the development team directly
- Nimit: guptanimit062@gmail.com
- Sanjeevni: sanjeevnidhir05@gmail.com
-
π Bug Reports: Open an issue on GitHub Issues
-
π‘ Feature Requests: Submit your ideas on GitHub Discussions
-
π Documentation: Check out our comprehensive guides in the
/docsfolder
Q: What file formats are supported?
A: DocClassify currently supports PDF, DOCX, and TXT files.
Q: What's the maximum file size?
A: The current limit is 10MB per file, but this can be configured.
Q: How accurate is the classification?
A: Our SVM model achieves around 85% accuracy, but this depends on your training data.
Q: Can I train my own model?
A: Yes, use the training scripts in the server/training/ directory.
Q: Is my data secure?
A: Files are processed locally and not stored permanently. For production use, implement proper security measures.
Join our growing community of developers and contributors:
- β Star this repo if you find it helpful
- π΄ Fork and contribute to make it better
- π£ Share with others who might benefit
- π¬ Join discussions to share ideas and feedback
Made with β€οΈ by Nimit Gupta & Sanjeevni Dhir
DocClassify - Transforming Document Management with AI

