Skip to content

Scalable, high-performance web crawler —built for fast, fault-tolerant crawling, parsing, and full-text search at scale.

Notifications You must be signed in to change notification settings

azimsh3r/webcrawler

Repository files navigation

🌐 Distributed Web Crawler

🚀 Overview

This is a high-performance distributed web crawler built using Java, Spring Boot, Kafka, Redis, and Elasticsearch. It efficiently fetches, parses, deduplicates, and indexes web pages for fast searching.

🔥 Features

Microservices Architecture – Independently scalable services ✅ Asynchronous Processing – Uses Kafka & Redis for efficiency ✅ Elasticsearch Integration – Enables full-text search on crawled pages ✅ Scalability & Fault Tolerance – Designed to handle large-scale crawling ✅ Deduplication System – Prevents duplicate URL processing

🏗️ Architecture

   [Internet] → [Crawler] → [Parser] → [URL Extractor] → [Deduplicator] → [Queue] → [Crawlers]
                                         ↓
                                      [Elasticsearch (Search Index)]

🔹 Service Responsibilities

  • Crawler Service – Fetches web pages from the internet.
  • HTML Parser Service – Extracts text & metadata from raw HTML.
  • URL Extractor Service – Extracts and sends unique links for crawling.
  • Deduplicator Service – Removes duplicate URLs using Redis/Bloom Filters.
  • URL Queue Service – Manages the queue of URLs to be crawled.
  • Search Indexer (Elasticsearch) – Enables fast keyword-based searching.

⚙️ Tech Stack

  • Backend: Java, Spring Boot
  • Message Queue: Kafka / RabbitMQ
  • Database: Redis (Deduplication), PostgreSQL (Metadata)
  • Search Engine: Elasticsearch
  • Containerization: Docker, Kubernetes (for deployment)

🛠️ Setup & Installation

1️⃣ Prerequisites

  • Java 17+
  • Docker & Docker Compose
  • Kafka & Zookeeper
  • Elasticsearch & Redis

2️⃣ Clone the Repository

git clone https://github.com/yourusername/web-crawler.git
cd web-crawler

3️⃣ Run Services with Docker Compose

docker-compose up -d

4️⃣ Start Each Microservice

Run each service separately using:

cd crawler-service && mvn spring-boot:run
cd parser-service && mvn spring-boot:run
cd extractor-service && mvn spring-boot:run
...

🚀 Usage

1️⃣ Add a Seed URL to Start Crawling

curl -X POST http://localhost:{urlservice's port} -d '['url']' -H "Content-Type: application/json"

2️⃣ Search Indexed Pages

curl -X GET 'http://localhost:9200/webpages/_search?q=your_keyword'

📊 Performance Optimizations

  • Asynchronous processing with RabbitMQ for handling high throughput
  • Redis caching for fast deduplication
  • Parallel crawling with multiple workers
  • Batch indexing in Elasticsearch for optimized search performance

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.


💡 Contributions are welcome! Feel free to submit PRs or issues. Happy Crawling! 🚀

About

Scalable, high-performance web crawler —built for fast, fault-tolerant crawling, parsing, and full-text search at scale.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages