Mini Search Engine

This project is a mini search engine designed to crawl a selection of programming documentation sites, index the content using Tantivy, and provide a fast, relevant search interface via Flask.

It showcases:

Crawler that fetches pages from specified domains, stores them in SQLite.
Indexer that builds a Tantivy index from crawled data.
Flask App providing routes to search and view stats, plus an optional UI for querying results.

Features

Crawl documentation from a user-defined list of domains.
Store crawled data (title, snippet, URL, domain) in a local SQLite database.
Index using Tantivy for low-latency, full-text search.
Flask routes for:
- POST api/crawl to trigger the crawler.
- POST api/index to rebuild the Tantivy index from the DB.
- GET /search?q=<query> to retrieve top results (JSON).
- GET /stats to view total pages and pages-per-domain stats (JSON).
** Simple Frontend** to search and display results with a snippet, title, and link.
Domain-based control and robots.txt adherence (if enabled) to respect site crawling policies.

Tech Stack & Design Decisions

Python – chosen for requests (crawler) and BeautifulSoup (HTML parsing), plus it has straightforward bindings for Tantivy.
Flask – lightweight web framework to serve search endpoints and optional HTML pages.
SQLite – quick, file-based database for storing crawled pages (url, title, snippet, and optionally domain).
Tantivy – a Rust-based search engine library with Python bindings, chosen for fast indexing and low-latency queries.

Challenges

Crawling
- Challenges:
  - Respecting domain/path restrictions and limiting pages (e.g., 10,000 max per domain).
  - Avoiding boilerplate content in docs that could reduce relevancy.
- Solutions:
  - Implemented a simple Python crawler with requests and BeautifulSoup to parse only allowed domains and store meaningful snippets.
  - Used domain-based filtering and (optionally) robots.txt checks to remain compliant.
  - Extracted the <title> and key body text while skipping script/style tags.
Indexing (Choice Between Tantivy and Vespa)
- Decision: Tantivy
  - Why Not Vespa? Vespa is powerful for large-scale, distributed use cases, but it’s more complex to set up for a “mini” search engine and frankly felt like an overkill for this.
  - Why Tantivy? Tantivy is lightweight, fast (written in Rust), and easy to integrate via Python bindings. It can handle our sub-50ms latency requirement out of the box with proper tuning.
- Challenges:
  - Balancing stored vs. indexed fields for performance (only store what we display, index what we need to search).
  - Ensuring consistent reindexing when new pages are crawled.
- Solutions:
  - Implemented a lean schema (title, snippet, url) with the right tokenizer to improve recall thereby improving accuracy
  - Provided an endpoint (/api/index) to recreate or update the index whenever needed.
Ranking & Relevancy
- Challenges:
  - Risk of repeated boilerplate content across pages or irrelevant results.
- Solutions:
  - Tokenized the snippet field using en_stem, which is a slower tokenizer but recommended to improve recall
Proxy Use
- Usage:
  - We kept it simple by making direct requests, limiting to allowed domains.
- How We Would Employ It:
  - For high-scale crawling or avoiding IP-based rate limits, we’d integrate a rotating proxy strategy. Each request could route through a proxy pool to distribute load and reduce the chance of being blocked.
Speed
- Challenge:
  - Keeping search speed under 50ms.
- How We Would Employ It:
  - This became increasingly challenging as the number of indexed pages grew, on local, achieving under 80ms was the norm. On deploying to production, based on the nature of the underlying machine it was deployed to and latency based on personal network, under 500ms seems to be the norm;
  - For a production ready site, such application will be deployed to a more robust environment, with multiple instances running, and a cache to improve speed
Miscellaneous Other engineering decisions but not limited to the list, that could improve the app would be:
- Implementing a cron job to crawl and index the records periodically to keep results fresh and relevant

Project Structure

crawler/crawled_data.db is generated automatically during crawling.
indexer/search_index/ is generated by Tantivy after indexing.

Installation & Setup

1. Running the app locally

There are two ways to get the app up and running locally; either you run it in a container or not; choose your pick

Here is a list of required params in the env

### Comma-separated list of domains to crawl
DOMAINS=angular.io,api.drupal.org,api.haxe.org

###  Maximum number of pages per domain
MAX_PAGES_PER_DOMAIN=1000

###  File path for the auto generated indexed
INDEX_PATH=indexer/search_index/

###  Environment specification: local, dev, production
ENVIRONMENT=local

Without Docker

Clone the Repository

git clone https://github.com/__username__/mini_search_engine.git
cd mini_search_engine

Create & Activate a Virtual Environment (optional, but recommended) python -m venv env source env/bin/activate # Mac/Linux or env\Scripts\activate #Windows
Install dependencies pip install -r requirements.txt
create and update .env
Run the app flask run
Reload index call the /api/index endpoint to trigger a reload of the indexes

With Docker

Clone the Repository

git clone https://github.com/__username__/mini_search_engine.git
cd mini_search_engine

Ensure Dependencies
- You have a working Python environment (3.13+ recommended).
Set Environment Variables
Run the docker file using the commands
- docker build -t mini-search-engine .
- docker run --env-file .env -p 5000:5000 mini-search-engine
Your app will accessible at the displayed url
Reload index call the /api/index endpoint to trigger a reload of the indexes

2. Deploying to production

Heroku

Simple and straightforward deployment can be done to heroku, you will need a heroku account and docker installed to test run the container. The dockerfile, and other cofig files will take care of all dependencies and getting your app running

create a repo on github
create an app on heroku
Change stack to container through the dashboard or via cli with heroku stack:set container -a mini-search-engine
connect the app to the github repo, and turn on auto deploys
fill in the necessary env vars

Ps. If you run into any issues, you might need to add env variables on your heroku dashboard. Just go to the dashboard of the created app > settings > config_vars; there you can add everything that should be in the env; and as always remember to call the url to api/index to refresh the index

Fly.io

You can also easily deploy the mini search engine app on fly; Once a repo is connected, it'll automatically pickup on the already generated fly.toml file

Usage

UI

If you navigate to http://127.0.0.1:5000/ or http://localhost:5000/, you will seee the page that lets you type in a query and see results displayed.

APIS

Crawling POST /api/crawl Triggers crawl_all_domains(). For each domain in .env:DOMAINS, the crawler fetches up to MAX_PAGES_PER_DOMAIN pages, storing them in crawler/crawled_data.db.
Indexing POST /api/index Calls index_data() to read from crawler/crawled_data.db and create or update the Tantivy index in indexer/search_index/.
Searching GET /api/search?q=Your+Query Returns a JSON list of top relevant results (title, url, snippet). Under the hood, uses BM25 with optional field boosting.
Stats GET /api/stats Returns JSON stats like total pages, pages per domain. GET /stats_page (optional) Renders an HTML page that calls /stats and displays a table of domain counts.

Contributing

Happy Searching! If you have questions or ideas, feel free to open an issue or reach out.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
__pycache__		__pycache__
app		app
crawler		crawler
indexer		indexer
templates		templates
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
fly.toml		fly.toml
heroku.yml		heroku.yml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini Search Engine

Table of Contents

Features

Tech Stack & Design Decisions

Challenges

Project Structure

Installation & Setup

1. Running the app locally

Here is a list of required params in the env

Without Docker

With Docker

2. Deploying to production

Heroku

Fly.io

Usage

UI

APIS

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mini Search Engine

Table of Contents

Features

Tech Stack & Design Decisions

Challenges

Project Structure

Installation & Setup

1. Running the app locally

Here is a list of required params in the env

Without Docker

With Docker

2. Deploying to production

Heroku

Fly.io

Usage

UI

APIS

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages