A high-performance, distributed search engine designed to crawl, index, and rank web content at scale. This project implements a Google-like search architecture using microservices, featuring a C++ crawler and indexer, Python-based BM25 ranking algorithm, and Ruby on Rails web interface.
- Features
- Architecture
- Technology Stack
- Project Structure
- Getting Started
- Usage
- API Documentation
- Development
- Architecture Details
- Troubleshooting
- Contributing
- License
- Distributed Web Crawler: High-performance C++ crawler with politeness controls and robots.txt support
- Efficient Indexing: Inverted index using RocksDB for fast lookups
- BM25 Ranking Algorithm: Advanced probabilistic ranking for relevant search results
- Microservices Architecture: Event-driven, queue-based system for scalability
- WARC Storage: Efficient HTML storage format with random access support
- Docker Support: Containerized deployment with Docker Compose
- Redis Caching: Fast query response with intelligent caching
- RESTful API: Clean API interface for search queries
This project follows a microservices-based, event-driven architecture with four main components:
| Component | Technology | Purpose |
|---|---|---|
| Crawler | C++ (C++17/20) | High concurrency, low memory footprint |
| Indexer | C++ | Fast string processing, I/O optimization |
| Ranker | Python (NumPy/Flask) | BM25 algorithm, matrix operations |
| Interface | Ruby on Rails 8.0 | MVC framework, API orchestration |
| Message Queue | Redis | Asynchronous task processing |
| Metadata DB | PostgreSQL 17.2 | Structured data storage |
| Index Store | RocksDB | High-performance key-value storage |
| Containerization | Docker & Docker Compose | Easy deployment and scaling |
Search-Engine/
βββ cpp/
β βββ crawler/ # C++ web crawler
β β βββ src/
β β β βββ main.cpp
β β β βββ warc_writer.cpp
β β β βββ warc_writer.hpp
β β βββ tests/
β β βββ Dockerfile
β βββ indexer/ # C++ indexer
β βββ src/
β βββ tests/
β βββ Dockerfile
βββ python/
β βββ ranker/ # Python ranking service
β βββ app.py # Flask application
β βββ engine.py # BM25 ranking logic
β βββ requirements.txt
β βββ Dockerfile
βββ API/ # Ruby on Rails interface
β βββ app/
β βββ config/
β βββ Gemfile
β βββ README.md
βββ data/
β βββ init.sql # Database initialization
β βββ crawled_pages/ # WARC storage
βββ docker-compose.yml # Service orchestration
βββ .env.example # Environment variables template
βββ ARCHITECTURE.md # Detailed architecture documentation
βββ README.md # This file
- Docker (version 20.10 or higher)
- Docker Compose (version 2.0 or higher)
- At least 4GB of available RAM
- 10GB of free disk space
-
Clone the repository
git clone https://github.com/Digvijay-x1/Search-Engine.git cd Search-Engine -
Set up environment variables
cp .env.example .env
-
Edit
.envfile with your configurationDB_USER=admin DB_PASS=your_secure_password DB_NAME=search_engine
β οΈ Important: Change the default password in production! -
Build and start all services
docker-compose up --build
This will start all services:
- Rails Interface: http://localhost:3000
- Python Ranker API: http://localhost:5000
- PostgreSQL: localhost:5434
- Redis: localhost:6380
- Crawler and Indexer: Running in background
The project uses environment variables for configuration. Key variables include:
DB_USER: Database usernameDB_PASS: Database passwordDB_NAME: Database nameDB_HOST: Database host (defaults topostgres_servicein Docker)FLASK_ENV: Flask environment (development/production)ROCKSDB_PATH: Path to RocksDB index files
Once all services are running, navigate to:
http://localhost:3000
The Python ranker service exposes a REST API:
Health Check
curl http://localhost:5000/healthSearch Query
curl "http://localhost:5000/search?q=your+search+query"Response Format
{
"query": "your search query",
"results": [
{
"id": 1,
"url": "https://example.com",
"title": "Example Page",
"snippet": "Relevant snippet from the page...",
"score": 4.52
}
],
"meta": {
"count": 10,
"latency_ms": 23.45
}
}Check the health status of the ranker service.
Response:
{
"status": "healthy",
"service": "ranker"
}Execute a search query.
Query Parameters:
q(required): Search query string
Response:
query: The original search queryresults: Array of ranked search resultsid: Document IDurl: Page URLtitle: Page titlesnippet: Text previewscore: BM25 relevance score
meta: Metadata about the searchcount: Number of resultslatency_ms: Query processing time
Start only the ranker service:
docker-compose up ranker_service postgres_service redis_serviceStart only the crawler:
docker-compose up crawler_service redis_service postgres_servicePython Ranker:
cd python/ranker
pip install -r requirements.txt
python app.pyC++ Crawler:
cd cpp/crawler
mkdir build && cd build
cmake ../src
make
./crawler# All services
docker-compose logs -f
# Specific service
docker-compose logs -f ranker_service
docker-compose logs -f crawler_service
docker-compose logs -f rails_interfaceConnect to PostgreSQL:
docker-compose exec postgres_service psql -U admin -d search_engineView crawled documents:
SELECT id, url, title, status FROM documents LIMIT 10;# Stop all services
docker-compose down
# Stop and remove volumes (clears database)
docker-compose down -vFor comprehensive architecture documentation, see ARCHITECTURE.md.
-
Web Crawler (C++)
- Implements URL frontier with Bloom filter for visited check
- Respects robots.txt and rate limiting
- Stores content in WARC format
- Handles DNS caching and connection pooling
-
Indexer (C++)
- Tokenizes and processes HTML content
- Builds inverted index in RocksDB
- Calculates document statistics for BM25
- Implements Porter2 stemming algorithm
-
Ranker (Python)
- BM25 (Okapi) ranking algorithm
- Vectorized operations with NumPy
- Memory-mapped index access
- Redis caching for frequent queries
-
Web Interface (Ruby on Rails)
- Query orchestration
- Result formatting and snippet generation
- Cache management
- User interface
-
Crawling Phase (Offline):
- Crawler fetches pages β Stores in WARC files
- Metadata saved to PostgreSQL
- Jobs queued in Redis
-
Indexing Phase (Offline):
- Indexer reads WARC files
- Extracts and tokenizes content
- Updates inverted index in RocksDB
- Updates document metadata
-
Search Phase (Online):
- User submits query via Rails interface
- Rails checks Redis cache
- If miss: Calls Python ranker API
- Ranker queries RocksDB index
- Returns ranked document IDs
- Rails fetches metadata from PostgreSQL
- Results displayed to user
Issue: Services won't start
# Check Docker is running
docker ps
# Check logs for errors
docker-compose logs
# Rebuild containers
docker-compose down
docker-compose up --buildIssue: Database connection errors
# Verify environment variables
cat .env
# Check PostgreSQL is running
docker-compose ps postgres_service
# Restart database service
docker-compose restart postgres_serviceIssue: Port already in use
# Find process using port
lsof -i :3000
lsof -i :5000
# Kill the process or change port in docker-compose.ymlIssue: Out of memory
# Increase Docker memory limit in Docker Desktop settings
# Or reduce number of running services# Check all running containers
docker-compose ps
# Test ranker API
curl http://localhost:5000/health
# Test Rails interface
curl http://localhost:3000Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Follow existing code style and conventions
- Write meaningful commit messages
- Add tests for new features
- Update documentation as needed
- Ensure Docker builds succeed
This project is open source and available under the MIT License.
- Inspired by Google's original search engine architecture
- Built with modern microservices best practices
- Uses industry-standard algorithms (BM25, Porter2 Stemmer)
For questions and support:
- Open an issue on GitHub
- Check the ARCHITECTURE.md for detailed technical information
- Review the API README: API/README.md
Built with β€οΈ by the openSearch Team
