Skip to content

hitesh103/spider-man

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕷️ Web Event Scraper/Spider

A sophisticated Node.js web scraping application designed to crawl event websites, analyze URL patterns, and categorize event-related pages for data collection and analysis.

📋 Table of Contents

🎯 Overview

This project is a specialized web crawler that:

  • Discovers all URLs on event websites
  • Analyzes and categorizes URLs into listing pages vs event detail pages
  • Identifies patterns in event website structures
  • Stores results in MongoDB for further analysis
  • Provides insights for event discovery platforms

Key Features

  • 🔍 Intelligent URL Analysis: Automatically categorizes URLs based on patterns
  • 🚀 Concurrent Crawling: Efficient parallel processing with configurable concurrency
  • 📊 Pattern Recognition: Identifies common event website structures
  • 💾 Data Persistence: MongoDB storage with structured schemas
  • 📝 Comprehensive Logging: Winston-based logging system
  • ⚙️ Configurable: Flexible configuration for different crawling scenarios

🏗️ System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Web Event Scraper                      │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │
│  │   Entry     │    │  Controller │    │   Services  │      │
│  │   Point     │───▶│  Layer      │───▶│   Layer     │      │
│  │ (index.js)  │    │             │    │             │      │
│  └─────────────┘    └─────────────┘    └─────────────┘      │
│         │                   │                   │             │
│         ▼                   ▼                   ▼             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │
│  │   Utils     │    │   Models    │    │   Config    │      │
│  │   Layer     │    │   Layer     │    │   Layer     │      │
│  │             │    │             │    │             │      │
│  └─────────────┘    └─────────────┘    └─────────────┘      │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │
│  │  MongoDB    │    │  Puppeteer  │    │   Winston   │      │
│  │  Database   │    │   Browser   │    │   Logger    │      │
│  │             │    │             │    │             │      │
│  └─────────────┘    └─────────────┘    └─────────────┘      │
└─────────────────────────────────────────────────────────────────┘

Architecture Layers

1. Entry Point Layer

  • index.js: Main application entry point
  • Initializes scraping process
  • Handles top-level error management

2. Controller Layer

  • controllers/ScraperController.js: Orchestrates the entire scraping workflow
  • Manages service coordination
  • Handles data persistence

3. Service Layer

  • services/WebCrawler.js: Core crawling functionality
  • services/EventUrlAnalyzer.js: URL analysis and categorization
  • services/PageContentService.js: Page content retrieval

4. Model Layer

  • models/DomainDataModel.js: MongoDB schema definitions
  • Data structure definitions

5. Utility Layer

  • utils/ScraperLogger.js: Logging functionality
  • utils/ConcurrencyManager.js: Concurrency control

6. Configuration Layer

  • config/scraper.config.js: Application configuration

🔄 Workflow Diagram

graph TD
    A[Start: index.js] --> B[Initialize MongoDB Connection]
    B --> C[Create Service Instances]
    C --> D[Start Web Crawling]
    D --> E[Process Domain URLs]
    E --> F[Extract URLs from Pages]
    F --> G[Validate URLs]
    G --> H[Add to URL Collection]
    H --> I{More URLs to Process?}
    I -->|Yes| E
    I -->|No| J[Analyze All URLs]
    J --> K[Categorize URLs]
    K --> L[Identify Patterns]
    L --> M[Save to MongoDB]
    M --> N[Return Results]
    N --> O[Close Connections]
    O --> P[End]
    
    style A fill:#e1f5fe
    style P fill:#e8f5e8
    style M fill:#fff3e0
Loading

Detailed Workflow Steps

1. Initialization Phase

// Entry point triggers the workflow
const result = await startScraping('https://indiarunning.com');

2. Service Setup

// Create service instances
const pageContentService = new PageContentService();
const webCrawler = new WebCrawler(pageContentService);
const eventUrlAnalyzer = new EventUrlAnalyzer();

3. Crawling Phase

// Crawl the domain and collect URLs
const urls = await webCrawler.crawl(domain);

4. Analysis Phase

// Analyze and categorize URLs
const analysis = await eventUrlAnalyzer.analyzeUrls(urls);

5. Data Persistence

// Save results to MongoDB
const domainData = new DomainDataModel({
    domain,
    urls: { all: urls, listing: analysis.listingPages, event: analysis.eventPages },
    patterns: analysis.patterns,
    metadata: { totalUrls: analysis.totalUrls, crawlDate: new Date() }
});
await domainData.save();

📊 Data Flow

graph LR
    A[Target Domain] --> B[WebCrawler]
    B --> C[PageContentService]
    C --> D[Puppeteer Browser]
    D --> E[HTML Content]
    E --> F[URL Extraction]
    F --> G[URL Collection]
    G --> H[EventUrlAnalyzer]
    H --> I[URL Categorization]
    I --> J[Pattern Analysis]
    J --> K[MongoDB Storage]
    
    subgraph "URL Types"
        L[Listing Pages]
        M[Event Detail Pages]
        N[Other Pages]
    end
    
    I --> L
    I --> M
    I --> N
    
    style A fill:#ffebee
    style K fill:#e8f5e8
Loading

🧩 Component Details

1. WebCrawler Service

Purpose: Core crawling engine that discovers URLs on websites

Key Features:

  • Depth-limited crawling (configurable)
  • Concurrent processing
  • URL validation and filtering
  • Duplicate prevention

Methods:

  • crawl(domain, maxDepth): Main crawling method
  • processUrl(url, depth, maxDepth): Process individual URLs
  • extractUrls(html, baseUrl): Extract URLs from HTML
  • isValidUrl(url, baseUrl): Validate URLs

2. EventUrlAnalyzer Service

Purpose: Analyzes and categorizes URLs into event-related patterns

URL Categories:

  • Listing Pages: /events, /calendar, /whats-on, etc.
  • Event Detail Pages: /event/, /show/, /concert/, etc.

Pattern Recognition:

// Listing page patterns
listingPatterns = [
    '/city/', '/distance/', '/events', '/calendar',
    '/whats-on', '/agenda', '/activities', '/programme'
    // ... 50+ patterns
];

// Event detail patterns
eventDetailPatterns = [
    '/e/', '/events/', '/event/', '/show/', '/gig/',
    '/concert/', '/performance/', '/exhibition/'
    // ... 30+ patterns
];

3. PageContentService

Purpose: Handles page content retrieval using Puppeteer

Features:

  • Browser instance management
  • Page content extraction
  • Error handling for failed requests

4. DomainDataModel

Purpose: MongoDB schema for storing crawl results

Schema Structure:

{
    domain: String,           // Target domain
    urls: {
        all: [String],        // All discovered URLs
        listing: [String],    // Listing page URLs
        event: [String]       // Event detail URLs
    },
    patterns: {
        listing: Mixed,       // Listing page patterns
        event: Mixed          // Event detail patterns
    },
    metadata: {
        totalUrls: Number,    // Total URLs found
        crawlDate: Date       // Crawl timestamp
    }
}

🚀 Installation & Setup

Prerequisites

  • Node.js (v14 or higher)
  • MongoDB (local or cloud instance)
  • npm or yarn

Installation Steps

  1. Clone the repository
git clone <repository-url>
cd spider-main
  1. Install dependencies
npm install
  1. Configure MongoDB
# Set MongoDB URI in environment variable
export MONGODB_URI="mongodb+srv://username:password@cluster.mongodb.net/spider"

# Or update config/scraper.config.js directly
  1. Verify installation
node index.js

📖 Usage

Basic Usage

const { startScraping } = require('./controllers/ScraperController');

(async () => {
    try {
        const result = await startScraping('https://example-event-site.com');
        console.log('Scraping completed:', result);
    } catch (error) {
        console.error('Scraping failed:', error);
    }
})();

Test Different Domains

// Test file: test.js
const urls = await webCrawler.crawl("https://choosechicago.com");
console.log(urls);

Expected Output

{
    success: true,
    urlsFound: 150,
    listingPages: 25,
    eventPages: 75
}

⚙️ Configuration

MongoDB Configuration

// config/scraper.config.js
mongodb: {
    uri: process.env.MONGODB_URI || 'mongodb://localhost:27017/spider'
}

Puppeteer Configuration

puppeteer: {
    headless: false,  // Set to true for production
    args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-gpu',
        '--disable-dev-shm-usage'
    ]
}

Crawling Configuration

crawling: {
    maxDepth: 1,        // How deep to crawl
    maxConcurrency: 1,  // Concurrent requests
    waitTime: 10000,    // Wait between requests (ms)
    timeout: 100000     // Request timeout (ms)
}

📚 API Reference

ScraperController

startScraping(domain)

Initiates the scraping process for a given domain.

Parameters:

  • domain (string): The target domain to crawl

Returns:

{
    success: boolean,
    urlsFound: number,
    listingPages: number,
    eventPages: number
}

WebCrawler

crawl(domain, maxDepth)

Crawls a domain and returns all discovered URLs.

Parameters:

  • domain (string): Target domain
  • maxDepth (number): Maximum crawl depth

Returns: Array of discovered URLs

EventUrlAnalyzer

analyzeUrls(urls)

Analyzes an array of URLs and categorizes them.

Parameters:

  • urls (Array): Array of URLs to analyze

Returns:

{
    listingPages: Array,
    eventPages: Array,
    patterns: Object,
    totalUrls: number
}

🔧 Troubleshooting

Common Issues

1. MongoDB Connection Failed

# Check MongoDB URI
echo $MONGODB_URI

# Verify network connectivity
ping your-mongodb-cluster

2. Puppeteer Launch Issues

// Add these args to puppeteer config
args: [
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-gpu',
    '--disable-dev-shm-usage',
    '--disable-web-security',
    '--disable-features=VizDisplayCompositor'
]

3. Memory Issues

// Reduce concurrency in config
crawling: {
    maxConcurrency: 1,  // Reduce from higher values
    maxDepth: 1         // Reduce crawl depth
}

4. Rate Limiting

// Increase wait time between requests
crawling: {
    waitTime: 15000  // Increase from 10000
}

Debug Mode

// Enable detailed logging
const ScraperLogger = require('./utils/ScraperLogger');
ScraperLogger.setLevel('debug');

📈 Performance Optimization

1. Concurrency Tuning

// Adjust based on target server capacity
crawling: {
    maxConcurrency: 2,  // Increase for faster crawling
    waitTime: 5000      // Decrease for faster crawling
}

2. Memory Management

// Close browser instances properly
await pageContentService.close();
await mongoose.connection.close();

3. URL Filtering

// Add more exclusion patterns
urlPatterns: {
    excluded: ['.pdf', '.jpg', '.png', '.gif', '.css', '.js', '.xml']
}

🔮 Future Enhancements

Planned Features

  • Distributed Crawling: Support for multiple crawler instances
  • Advanced Pattern Learning: ML-based pattern recognition
  • Real-time Monitoring: Web dashboard for crawl progress
  • API Endpoints: REST API for triggering crawls
  • Export Formats: CSV, JSON, XML export options
  • Scheduling: Automated crawl scheduling
  • Proxy Support: Rotating proxy support for large-scale crawling

Architecture Improvements

  • Microservices: Split into separate services
  • Message Queues: Redis/RabbitMQ for job distribution
  • Caching: Redis caching for frequently accessed data
  • Monitoring: Prometheus/Grafana integration

📞 Support

For issues and questions:

  • Review the logs in the logs/ directory

Happy Crawling! 🕷️

About

Central crawler and worker system for scraping, normalizing, and ingesting data at scale.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors