IGI by openSearch

A high-performance, distributed search engine designed to crawl, index, and rank web content at scale. This project implements a Google-like search architecture using microservices, featuring a C++ crawler and indexer, Python-based BM25 ranking algorithm, and Ruby on Rails web interface.

✨ Features

Distributed Web Crawler: High-performance C++ crawler with politeness controls and robots.txt support
Efficient Indexing: Inverted index using RocksDB for fast lookups
BM25 Ranking Algorithm: Advanced probabilistic ranking for relevant search results
Microservices Architecture: Event-driven, queue-based system for scalability
WARC Storage: Efficient HTML storage format with random access support
Docker Support: Containerized deployment with Docker Compose
Redis Caching: Fast query response with intelligent caching
RESTful API: Clean API interface for search queries

🏗️ Architecture

This project follows a microservices-based, event-driven architecture with four main components:

🛠️ Technology Stack

Component	Technology	Purpose
Crawler	C++ (C++17/20)	High concurrency, low memory footprint
Indexer	C++	Fast string processing, I/O optimization
Ranker	Python (NumPy/Flask)	BM25 algorithm, matrix operations
Interface	Ruby on Rails 8.0	MVC framework, API orchestration
Message Queue	Redis	Asynchronous task processing
Metadata DB	PostgreSQL 17.2	Structured data storage
Index Store	RocksDB	High-performance key-value storage
Containerization	Docker & Docker Compose	Easy deployment and scaling

📁 Project Structure

Search-Engine/
├── cpp/
│   ├── crawler/          # C++ web crawler
│   │   ├── src/
│   │   │   ├── main.cpp
│   │   │   ├── warc_writer.cpp
│   │   │   └── warc_writer.hpp
│   │   ├── tests/
│   │   └── Dockerfile
│   └── indexer/          # C++ indexer
│       ├── src/
│       ├── tests/
│       └── Dockerfile
├── python/
│   └── ranker/           # Python ranking service
│       ├── app.py        # Flask application
│       ├── engine.py     # BM25 ranking logic
│       ├── requirements.txt
│       └── Dockerfile
├── API/                  # Ruby on Rails interface
│   ├── app/
│   ├── config/
│   ├── Gemfile
│   └── README.md
├── data/
│   ├── init.sql          # Database initialization
│   └── crawled_pages/    # WARC storage
├── docker-compose.yml    # Service orchestration
├── .env.example          # Environment variables template
├── ARCHITECTURE.md       # Detailed architecture documentation
└── README.md             # This file

🚀 Getting Started

Prerequisites

Docker (version 20.10 or higher)
Docker Compose (version 2.0 or higher)
At least 4GB of available RAM
10GB of free disk space

Installation

Clone the repository

git clone https://github.com/Digvijay-x1/Search-Engine.git
cd Search-Engine

Set up environment variables
```
cp .env.example .env
```
Edit .env file with your configuration
```
DB_USER=admin
DB_PASS=your_secure_password
DB_NAME=search_engine
```
⚠️ Important: Change the default password in production!
Build and start all services
```
docker-compose up --build
```
This will start all services:
- Rails Interface: http://localhost:3000
- Python Ranker API: http://localhost:5000
- PostgreSQL: localhost:5434
- Redis: localhost:6380
- Crawler and Indexer: Running in background

Configuration

The project uses environment variables for configuration. Key variables include:

DB_USER: Database username
DB_PASS: Database password
DB_NAME: Database name
DB_HOST: Database host (defaults to postgres_service in Docker)
FLASK_ENV: Flask environment (development/production)
ROCKSDB_PATH: Path to RocksDB index files

📖 Usage

Accessing the Search Interface

Once all services are running, navigate to:

http://localhost:3000

Using the Search API

The Python ranker service exposes a REST API:

Health Check

curl http://localhost:5000/health

Search Query

curl "http://localhost:5000/search?q=your+search+query"

Response Format

{
  "query": "your search query",
  "results": [
    {
      "id": 1,
      "url": "https://example.com",
      "title": "Example Page",
      "snippet": "Relevant snippet from the page...",
      "score": 4.52
    }
  ],
  "meta": {
    "count": 10,
    "latency_ms": 23.45
  }
}

📚 API Documentation

Ranker Service API

`GET /health`

Check the health status of the ranker service.

Response:

{
  "status": "healthy",
  "service": "ranker"
}

`GET /search`

Execute a search query.

Query Parameters:

q (required): Search query string

Response:

query: The original search query
results: Array of ranked search results
- id: Document ID
- url: Page URL
- title: Page title
- snippet: Text preview
- score: BM25 relevance score
meta: Metadata about the search
- count: Number of results
- latency_ms: Query processing time

🔧 Development

Running Individual Services

Start only the ranker service:

docker-compose up ranker_service postgres_service redis_service

Start only the crawler:

docker-compose up crawler_service redis_service postgres_service

Building Components Locally

Python Ranker:

cd python/ranker
pip install -r requirements.txt
python app.py

C++ Crawler:

cd cpp/crawler
mkdir build && cd build
cmake ../src
make
./crawler

Viewing Logs

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f ranker_service
docker-compose logs -f crawler_service
docker-compose logs -f rails_interface

Database Access

Connect to PostgreSQL:

docker-compose exec postgres_service psql -U admin -d search_engine

View crawled documents:

SELECT id, url, title, status FROM documents LIMIT 10;

Stopping Services

# Stop all services
docker-compose down

# Stop and remove volumes (clears database)
docker-compose down -v

🏛️ Architecture Details

For comprehensive architecture documentation, see ARCHITECTURE.md.

Key Components

Web Crawler (C++)
- Implements URL frontier with Bloom filter for visited check
- Respects robots.txt and rate limiting
- Stores content in WARC format
- Handles DNS caching and connection pooling
Indexer (C++)
- Tokenizes and processes HTML content
- Builds inverted index in RocksDB
- Calculates document statistics for BM25
- Implements Porter2 stemming algorithm
Ranker (Python)
- BM25 (Okapi) ranking algorithm
- Vectorized operations with NumPy
- Memory-mapped index access
- Redis caching for frequent queries
Web Interface (Ruby on Rails)
- Query orchestration
- Result formatting and snippet generation
- Cache management
- User interface

Data Flow

Crawling Phase (Offline):
- Crawler fetches pages → Stores in WARC files
- Metadata saved to PostgreSQL
- Jobs queued in Redis
Indexing Phase (Offline):
- Indexer reads WARC files
- Extracts and tokenizes content
- Updates inverted index in RocksDB
- Updates document metadata
Search Phase (Online):
- User submits query via Rails interface
- Rails checks Redis cache
- If miss: Calls Python ranker API
- Ranker queries RocksDB index
- Returns ranked document IDs
- Rails fetches metadata from PostgreSQL
- Results displayed to user

🐛 Troubleshooting

Common Issues

Issue: Services won't start

# Check Docker is running
docker ps

# Check logs for errors
docker-compose logs

# Rebuild containers
docker-compose down
docker-compose up --build

Issue: Database connection errors

# Verify environment variables
cat .env

# Check PostgreSQL is running
docker-compose ps postgres_service

# Restart database service
docker-compose restart postgres_service

Issue: Port already in use

# Find process using port
lsof -i :3000
lsof -i :5000

# Kill the process or change port in docker-compose.yml

Issue: Out of memory

# Increase Docker memory limit in Docker Desktop settings
# Or reduce number of running services

Checking Service Health

# Check all running containers
docker-compose ps

# Test ranker API
curl http://localhost:5000/health

# Test Rails interface
curl http://localhost:3000

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Development Guidelines

Follow existing code style and conventions
Write meaningful commit messages
Add tests for new features
Update documentation as needed
Ensure Docker builds succeed

📄 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

Inspired by Google's original search engine architecture
Built with modern microservices best practices
Uses industry-standard algorithms (BM25, Porter2 Stemmer)

📞 Support

For questions and support:

Open an issue on GitHub
Check the ARCHITECTURE.md for detailed technical information
Review the API README: API/README.md

Built with ❤️ by the openSearch Team

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github/workflows		.github/workflows
API		API
cpp		cpp
data		data
docs		docs
python/ranker		python/ranker
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
docker-compose.yml		docker-compose.yml

Digvijay-x1/IGI

Folders and files

Latest commit

History

Repository files navigation