This project is a mini search engine designed to crawl a selection of programming documentation sites, index the content using Tantivy, and provide a fast, relevant search interface via Flask.
It showcases:
- Crawler that fetches pages from specified domains, stores them in SQLite.
- Indexer that builds a Tantivy index from crawled data.
- Flask App providing routes to search and view stats, plus an optional UI for querying results.
- Crawl documentation from a user-defined list of domains.
- Store crawled data (title, snippet, URL, domain) in a local SQLite database.
- Index using Tantivy for low-latency, full-text search.
- Flask routes for:
POST api/crawlto trigger the crawler.POST api/indexto rebuild the Tantivy index from the DB.GET /search?q=<query>to retrieve top results (JSON).GET /statsto view total pages and pages-per-domain stats (JSON).
- ** Simple Frontend** to search and display results with a snippet, title, and link.
- Domain-based control and robots.txt adherence (if enabled) to respect site crawling policies.
- Python – chosen for requests (crawler) and BeautifulSoup (HTML parsing), plus it has straightforward bindings for Tantivy.
- Flask – lightweight web framework to serve search endpoints and optional HTML pages.
- SQLite – quick, file-based database for storing crawled pages (
url,title,snippet, and optionallydomain). - Tantivy – a Rust-based search engine library with Python bindings, chosen for fast indexing and low-latency queries.
-
Crawling
- Challenges:
- Respecting domain/path restrictions and limiting pages (e.g., 10,000 max per domain).
- Avoiding boilerplate content in docs that could reduce relevancy.
- Solutions:
- Implemented a simple Python crawler with
requestsandBeautifulSoupto parse only allowed domains and store meaningful snippets. - Used domain-based filtering and (optionally)
robots.txtchecks to remain compliant. - Extracted the
<title>and key body text while skipping script/style tags.
- Implemented a simple Python crawler with
- Challenges:
-
Indexing (Choice Between Tantivy and Vespa)
- Decision: Tantivy
- Why Not Vespa? Vespa is powerful for large-scale, distributed use cases, but it’s more complex to set up for a “mini” search engine and frankly felt like an overkill for this.
- Why Tantivy? Tantivy is lightweight, fast (written in Rust), and easy to integrate via Python bindings. It can handle our sub-50ms latency requirement out of the box with proper tuning.
- Challenges:
- Balancing stored vs. indexed fields for performance (only store what we display, index what we need to search).
- Ensuring consistent reindexing when new pages are crawled.
- Solutions:
- Implemented a lean schema (
title,snippet,url) with the righttokenizerto improve recall thereby improving accuracy - Provided an endpoint (
/api/index) to recreate or update the index whenever needed.
- Implemented a lean schema (
- Decision: Tantivy
-
Ranking & Relevancy
- Challenges:
- Risk of repeated boilerplate content across pages or irrelevant results.
- Solutions:
- Tokenized the snippet field using
en_stem, which is a slower tokenizer but recommended to improve recall
- Tokenized the snippet field using
- Challenges:
-
Proxy Use
- Usage:
- We kept it simple by making direct requests, limiting to allowed domains.
- How We Would Employ It:
- For high-scale crawling or avoiding IP-based rate limits, we’d integrate a rotating proxy strategy. Each request could route through a proxy pool to distribute load and reduce the chance of being blocked.
- Usage:
-
Speed
- Challenge:
- Keeping search speed under 50ms.
- How We Would Employ It:
- This became increasingly challenging as the number of indexed pages grew, on local, achieving under 80ms was the norm. On deploying to production, based on the nature of the underlying machine it was deployed to and latency based on personal network, under 500ms seems to be the norm;
- For a production ready site, such application will be deployed to a more robust environment, with multiple instances running, and a cache to improve speed
- Challenge:
-
Miscellaneous Other engineering decisions but not limited to the list, that could improve the app would be:
- Implementing a cron job to crawl and index the records periodically to keep results fresh and relevant
crawler/crawled_data.dbis generated automatically during crawling.indexer/search_index/is generated by Tantivy after indexing.
There are two ways to get the app up and running locally; either you run it in a container or not; choose your pick
### Comma-separated list of domains to crawl
DOMAINS=angular.io,api.drupal.org,api.haxe.org
### Maximum number of pages per domain
MAX_PAGES_PER_DOMAIN=1000
### File path for the auto generated indexed
INDEX_PATH=indexer/search_index/
### Environment specification: local, dev, production
ENVIRONMENT=local
-
Clone the Repository
git clone https://github.com/__username__/mini_search_engine.git cd mini_search_engine -
Create & Activate a Virtual Environment (optional, but recommended) python -m venv env source env/bin/activate # Mac/Linux or env\Scripts\activate #Windows
-
Install dependencies pip install -r requirements.txt
-
create and update .env
-
Run the app flask run
-
Reload index call the /api/index endpoint to trigger a reload of the indexes
-
Clone the Repository
git clone https://github.com/__username__/mini_search_engine.git cd mini_search_engine -
Ensure Dependencies
- You have a working Python environment (3.13+ recommended).
-
Set Environment Variables
-
Run the docker file using the commands
- docker build -t mini-search-engine .
- docker run --env-file .env -p 5000:5000 mini-search-engine
Your app will accessible at the displayed url
-
Reload index call the /api/index endpoint to trigger a reload of the indexes
Simple and straightforward deployment can be done to heroku, you will need a heroku account and docker installed to test run the container. The dockerfile, and other cofig files will take care of all dependencies and getting your app running
-
create a repo on github
-
create an app on heroku
-
Change stack to container through the dashboard or via cli with
heroku stack:set container -a mini-search-engine -
connect the app to the github repo, and turn on auto deploys
-
fill in the necessary env vars
Ps. If you run into any issues, you might need to add env variables on your heroku dashboard. Just go to the dashboard of the created app > settings > config_vars; there you can add everything that should be in the env; and as always remember to call the url to api/index to refresh the index
You can also easily deploy the mini search engine app on fly; Once a repo is connected, it'll automatically pickup on the already generated fly.toml file
If you navigate to http://127.0.0.1:5000/ or http://localhost:5000/, you will seee the page that lets you type in a query and see results displayed.
- Crawling POST /api/crawl Triggers crawl_all_domains(). For each domain in .env:DOMAINS, the crawler fetches up to MAX_PAGES_PER_DOMAIN pages, storing them in crawler/crawled_data.db.
- Indexing POST /api/index Calls index_data() to read from crawler/crawled_data.db and create or update the Tantivy index in indexer/search_index/.
- Searching GET /api/search?q=Your+Query Returns a JSON list of top relevant results (title, url, snippet). Under the hood, uses BM25 with optional field boosting.
- Stats GET /api/stats Returns JSON stats like total pages, pages per domain. GET /stats_page (optional) Renders an HTML page that calls /stats and displays a table of domain counts.
Happy Searching! If you have questions or ideas, feel free to open an issue or reach out.