Skip to content

Python-based search engine with lexicon, forward/inverted index, and Flask web UI — inspired by Google's original paper

Notifications You must be signed in to change notification settings

humzakt/DSA_Search_Engine

Repository files navigation

Search Engine

A Python-based search engine with lexicon generation, forward/inverted indexing, and a Flask web interface. Built as an end-semester project for Data Structures & Algorithms (Fall 2021) at NUST.

Inspired by The Anatomy of a Large-Scale Hypertextual Web Search Engine (Brin & Page, Stanford).

Features

  • Document Upload — Upload text documents through the web UI
  • Lexicon Generation — Tokenization with stopword removal using NLTK
  • Forward Index — Bucket-based indexing on document IDs with duplicate elimination
  • Inverted Index — Multi-threaded construction by splitting datasets across threads, building temporary indices, and merging
  • Search — Query single words or phrases against the inverted index for fast retrieval
  • Web Interface — Flask-powered frontend with HTML/CSS/JavaScript

Tech Stack

  • Python — core engine and indexing logic
  • Flask — web server and API
  • NLTK — tokenization and stopword removal
  • HTML / CSS / JavaScript — frontend

Getting Started

# Clone the repo
git clone https://github.com/humzakt/DSA_Search_Engine.git
cd DSA_Search_Engine

# Install dependencies
pip install -r requirements.txt

# Download NLTK data
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"

# Run the server
python RUNSERVER.py

Then open http://127.0.0.1:5000/ in your browser.

Before searching: Upload documents and generate the lexicon, forward index, and inverted index using the controls on the search page.

How It Works

  1. Upload — Documents are uploaded via the web interface
  2. Lexicon — Tokens are extracted and stored in a dictionary
  3. Forward Index — Built with threading; maps document IDs to term lists
  4. Inverted Index — Built with threading; the dataset is split, temporary indices are created in parallel, then merged
  5. Search — Query terms are looked up in the inverted index to retrieve matching documents

Project Structure

├── RUNSERVER.py                # Flask entry point
├── ProjectConfiguration.py     # Config settings
├── Lexicon.py                  # Lexicon data structure
├── GenerateLexicon.py          # Lexicon builder
├── ForwardIndex.py             # Forward index data structure
├── GenerateForwardIndex.py     # Forward index builder (threaded)
├── InvertedIndex.py            # Inverted index data structure
├── GenerateInvertedIndex.py    # Inverted index builder (threaded)
├── processFile.py              # Document preprocessing
├── search/                     # Search logic
├── flask_server/               # Flask routes and templates
├── Dataset/                    # Sample documents
└── Output/                     # Generated index files

Authors

  • Humza Khawar — 343114
  • M. Huzaifa — 332839

Submitted to Dr. Faisal Shafait, NUST.

About

Python-based search engine with lexicon, forward/inverted index, and Flask web UI — inspired by Google's original paper

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •