A Python-based search engine with lexicon generation, forward/inverted indexing, and a Flask web interface. Built as an end-semester project for Data Structures & Algorithms (Fall 2021) at NUST.
Inspired by The Anatomy of a Large-Scale Hypertextual Web Search Engine (Brin & Page, Stanford).
- Document Upload — Upload text documents through the web UI
- Lexicon Generation — Tokenization with stopword removal using NLTK
- Forward Index — Bucket-based indexing on document IDs with duplicate elimination
- Inverted Index — Multi-threaded construction by splitting datasets across threads, building temporary indices, and merging
- Search — Query single words or phrases against the inverted index for fast retrieval
- Web Interface — Flask-powered frontend with HTML/CSS/JavaScript
- Python — core engine and indexing logic
- Flask — web server and API
- NLTK — tokenization and stopword removal
- HTML / CSS / JavaScript — frontend
# Clone the repo
git clone https://github.com/humzakt/DSA_Search_Engine.git
cd DSA_Search_Engine
# Install dependencies
pip install -r requirements.txt
# Download NLTK data
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
# Run the server
python RUNSERVER.pyThen open http://127.0.0.1:5000/ in your browser.
Before searching: Upload documents and generate the lexicon, forward index, and inverted index using the controls on the search page.
- Upload — Documents are uploaded via the web interface
- Lexicon — Tokens are extracted and stored in a dictionary
- Forward Index — Built with threading; maps document IDs to term lists
- Inverted Index — Built with threading; the dataset is split, temporary indices are created in parallel, then merged
- Search — Query terms are looked up in the inverted index to retrieve matching documents
├── RUNSERVER.py # Flask entry point
├── ProjectConfiguration.py # Config settings
├── Lexicon.py # Lexicon data structure
├── GenerateLexicon.py # Lexicon builder
├── ForwardIndex.py # Forward index data structure
├── GenerateForwardIndex.py # Forward index builder (threaded)
├── InvertedIndex.py # Inverted index data structure
├── GenerateInvertedIndex.py # Inverted index builder (threaded)
├── processFile.py # Document preprocessing
├── search/ # Search logic
├── flask_server/ # Flask routes and templates
├── Dataset/ # Sample documents
└── Output/ # Generated index files
- Humza Khawar — 343114
- M. Huzaifa — 332839
Submitted to Dr. Faisal Shafait, NUST.