Skip to content

friilancer/mini-search-engine

Repository files navigation

Mini Search Engine

This project is a mini search engine designed to crawl a selection of programming documentation sites, index the content using Tantivy, and provide a fast, relevant search interface via Flask.

It showcases:

  • Crawler that fetches pages from specified domains, stores them in SQLite.
  • Indexer that builds a Tantivy index from crawled data.
  • Flask App providing routes to search and view stats, plus an optional UI for querying results.

Table of Contents

  1. Features
  2. Tech Stack & Design Decisions
  3. Project Structure
  4. Installation & Setup
  5. Usage
  6. Contributing

Features

  • Crawl documentation from a user-defined list of domains.
  • Store crawled data (title, snippet, URL, domain) in a local SQLite database.
  • Index using Tantivy for low-latency, full-text search.
  • Flask routes for:
    • POST api/crawl to trigger the crawler.
    • POST api/index to rebuild the Tantivy index from the DB.
    • GET /search?q=<query> to retrieve top results (JSON).
    • GET /stats to view total pages and pages-per-domain stats (JSON).
  • ** Simple Frontend** to search and display results with a snippet, title, and link.
  • Domain-based control and robots.txt adherence (if enabled) to respect site crawling policies.

Tech Stack & Design Decisions

  1. Python – chosen for requests (crawler) and BeautifulSoup (HTML parsing), plus it has straightforward bindings for Tantivy.
  2. Flask – lightweight web framework to serve search endpoints and optional HTML pages.
  3. SQLite – quick, file-based database for storing crawled pages (url, title, snippet, and optionally domain).
  4. Tantivy – a Rust-based search engine library with Python bindings, chosen for fast indexing and low-latency queries.

Challenges

  1. Crawling

    • Challenges:
      • Respecting domain/path restrictions and limiting pages (e.g., 10,000 max per domain).
      • Avoiding boilerplate content in docs that could reduce relevancy.
    • Solutions:
      • Implemented a simple Python crawler with requests and BeautifulSoup to parse only allowed domains and store meaningful snippets.
      • Used domain-based filtering and (optionally) robots.txt checks to remain compliant.
      • Extracted the <title> and key body text while skipping script/style tags.
  2. Indexing (Choice Between Tantivy and Vespa)

    • Decision: Tantivy
      • Why Not Vespa? Vespa is powerful for large-scale, distributed use cases, but it’s more complex to set up for a “mini” search engine and frankly felt like an overkill for this.
      • Why Tantivy? Tantivy is lightweight, fast (written in Rust), and easy to integrate via Python bindings. It can handle our sub-50ms latency requirement out of the box with proper tuning.
    • Challenges:
      • Balancing stored vs. indexed fields for performance (only store what we display, index what we need to search).
      • Ensuring consistent reindexing when new pages are crawled.
    • Solutions:
      • Implemented a lean schema (title, snippet, url) with the right tokenizer to improve recall thereby improving accuracy
      • Provided an endpoint (/api/index) to recreate or update the index whenever needed.
  3. Ranking & Relevancy

    • Challenges:
      • Risk of repeated boilerplate content across pages or irrelevant results.
    • Solutions:
      • Tokenized the snippet field using en_stem, which is a slower tokenizer but recommended to improve recall
  4. Proxy Use

    • Usage:
      • We kept it simple by making direct requests, limiting to allowed domains.
    • How We Would Employ It:
      • For high-scale crawling or avoiding IP-based rate limits, we’d integrate a rotating proxy strategy. Each request could route through a proxy pool to distribute load and reduce the chance of being blocked.
  5. Speed

    • Challenge:
      • Keeping search speed under 50ms.
    • How We Would Employ It:
      • This became increasingly challenging as the number of indexed pages grew, on local, achieving under 80ms was the norm. On deploying to production, based on the nature of the underlying machine it was deployed to and latency based on personal network, under 500ms seems to be the norm;
      • For a production ready site, such application will be deployed to a more robust environment, with multiple instances running, and a cache to improve speed
  6. Miscellaneous Other engineering decisions but not limited to the list, that could improve the app would be:

    • Implementing a cron job to crawl and index the records periodically to keep results fresh and relevant

Project Structure

  • crawler/crawled_data.db is generated automatically during crawling.
  • indexer/search_index/ is generated by Tantivy after indexing.

Installation & Setup

1. Running the app locally

There are two ways to get the app up and running locally; either you run it in a container or not; choose your pick

Here is a list of required params in the env

### Comma-separated list of domains to crawl
DOMAINS=angular.io,api.drupal.org,api.haxe.org

###  Maximum number of pages per domain
MAX_PAGES_PER_DOMAIN=1000

###  File path for the auto generated indexed
INDEX_PATH=indexer/search_index/

###  Environment specification: local, dev, production
ENVIRONMENT=local

Without Docker

  1. Clone the Repository

    git clone https://github.com/__username__/mini_search_engine.git
    cd mini_search_engine
    
  2. Create & Activate a Virtual Environment (optional, but recommended) python -m venv env source env/bin/activate # Mac/Linux or env\Scripts\activate #Windows

  3. Install dependencies pip install -r requirements.txt

  4. create and update .env

  5. Run the app flask run

  6. Reload index call the /api/index endpoint to trigger a reload of the indexes

With Docker

  1. Clone the Repository

    git clone https://github.com/__username__/mini_search_engine.git
    cd mini_search_engine
    
  2. Ensure Dependencies

    • You have a working Python environment (3.13+ recommended).
  3. Set Environment Variables

  4. Run the docker file using the commands

    • docker build -t mini-search-engine .
    • docker run --env-file .env -p 5000:5000 mini-search-engine

    Your app will accessible at the displayed url

  5. Reload index call the /api/index endpoint to trigger a reload of the indexes


2. Deploying to production

Heroku

Simple and straightforward deployment can be done to heroku, you will need a heroku account and docker installed to test run the container. The dockerfile, and other cofig files will take care of all dependencies and getting your app running

  • create a repo on github

  • create an app on heroku

  • Change stack to container through the dashboard or via cli with heroku stack:set container -a mini-search-engine

  • connect the app to the github repo, and turn on auto deploys

  • fill in the necessary env vars

    Ps. If you run into any issues, you might need to add env variables on your heroku dashboard. Just go to the dashboard of the created app > settings > config_vars; there you can add everything that should be in the env; and as always remember to call the url to api/index to refresh the index

Fly.io

You can also easily deploy the mini search engine app on fly; Once a repo is connected, it'll automatically pickup on the already generated fly.toml file

Usage

UI

If you navigate to http://127.0.0.1:5000/ or http://localhost:5000/, you will seee the page that lets you type in a query and see results displayed.

APIS

  1. Crawling POST /api/crawl Triggers crawl_all_domains(). For each domain in .env:DOMAINS, the crawler fetches up to MAX_PAGES_PER_DOMAIN pages, storing them in crawler/crawled_data.db.
  2. Indexing POST /api/index Calls index_data() to read from crawler/crawled_data.db and create or update the Tantivy index in indexer/search_index/.
  3. Searching GET /api/search?q=Your+Query Returns a JSON list of top relevant results (title, url, snippet). Under the hood, uses BM25 with optional field boosting.
  4. Stats GET /api/stats Returns JSON stats like total pages, pages per domain. GET /stats_page (optional) Renders an HTML page that calls /stats and displays a table of domain counts.

Contributing

Happy Searching! If you have questions or ideas, feel free to open an issue or reach out.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors