🕷️ Web Event Scraper/Spider

A sophisticated Node.js web scraping application designed to crawl event websites, analyze URL patterns, and categorize event-related pages for data collection and analysis.

📋 Table of Contents

Overview
System Architecture
Workflow Diagram
Data Flow
Component Details
Installation & Setup
Usage
Configuration
API Reference
Troubleshooting

🎯 Overview

This project is a specialized web crawler that:

Discovers all URLs on event websites
Analyzes and categorizes URLs into listing pages vs event detail pages
Identifies patterns in event website structures
Stores results in MongoDB for further analysis
Provides insights for event discovery platforms

Key Features

🔍 Intelligent URL Analysis: Automatically categorizes URLs based on patterns
🚀 Concurrent Crawling: Efficient parallel processing with configurable concurrency
📊 Pattern Recognition: Identifies common event website structures
💾 Data Persistence: MongoDB storage with structured schemas
📝 Comprehensive Logging: Winston-based logging system
⚙️ Configurable: Flexible configuration for different crawling scenarios

🏗️ System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Web Event Scraper                      │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │
│  │   Entry     │    │  Controller │    │   Services  │      │
│  │   Point     │───▶│  Layer      │───▶│   Layer     │      │
│  │ (index.js)  │    │             │    │             │      │
│  └─────────────┘    └─────────────┘    └─────────────┘      │
│         │                   │                   │             │
│         ▼                   ▼                   ▼             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │
│  │   Utils     │    │   Models    │    │   Config    │      │
│  │   Layer     │    │   Layer     │    │   Layer     │      │
│  │             │    │             │    │             │      │
│  └─────────────┘    └─────────────┘    └─────────────┘      │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │
│  │  MongoDB    │    │  Puppeteer  │    │   Winston   │      │
│  │  Database   │    │   Browser   │    │   Logger    │      │
│  │             │    │             │    │             │      │
│  └─────────────┘    └─────────────┘    └─────────────┘      │
└─────────────────────────────────────────────────────────────────┘

Architecture Layers

1. Entry Point Layer

index.js: Main application entry point
Initializes scraping process
Handles top-level error management

2. Controller Layer

controllers/ScraperController.js: Orchestrates the entire scraping workflow
Manages service coordination
Handles data persistence

3. Service Layer

services/WebCrawler.js: Core crawling functionality
services/EventUrlAnalyzer.js: URL analysis and categorization
services/PageContentService.js: Page content retrieval

4. Model Layer

models/DomainDataModel.js: MongoDB schema definitions
Data structure definitions

5. Utility Layer

utils/ScraperLogger.js: Logging functionality
utils/ConcurrencyManager.js: Concurrency control

6. Configuration Layer

config/scraper.config.js: Application configuration

🔄 Workflow Diagram

graph TD
    A[Start: index.js] --> B[Initialize MongoDB Connection]
    B --> C[Create Service Instances]
    C --> D[Start Web Crawling]
    D --> E[Process Domain URLs]
    E --> F[Extract URLs from Pages]
    F --> G[Validate URLs]
    G --> H[Add to URL Collection]
    H --> I{More URLs to Process?}
    I -->|Yes| E
    I -->|No| J[Analyze All URLs]
    J --> K[Categorize URLs]
    K --> L[Identify Patterns]
    L --> M[Save to MongoDB]
    M --> N[Return Results]
    N --> O[Close Connections]
    O --> P[End]
    
    style A fill:#e1f5fe
    style P fill:#e8f5e8
    style M fill:#fff3e0

Detailed Workflow Steps

1. Initialization Phase

// Entry point triggers the workflow
const result = await startScraping('https://indiarunning.com');

2. Service Setup

// Create service instances
const pageContentService = new PageContentService();
const webCrawler = new WebCrawler(pageContentService);
const eventUrlAnalyzer = new EventUrlAnalyzer();

3. Crawling Phase

// Crawl the domain and collect URLs
const urls = await webCrawler.crawl(domain);

4. Analysis Phase

// Analyze and categorize URLs
const analysis = await eventUrlAnalyzer.analyzeUrls(urls);

5. Data Persistence

// Save results to MongoDB
const domainData = new DomainDataModel({
    domain,
    urls: { all: urls, listing: analysis.listingPages, event: analysis.eventPages },
    patterns: analysis.patterns,
    metadata: { totalUrls: analysis.totalUrls, crawlDate: new Date() }
});
await domainData.save();

📊 Data Flow

graph LR
    A[Target Domain] --> B[WebCrawler]
    B --> C[PageContentService]
    C --> D[Puppeteer Browser]
    D --> E[HTML Content]
    E --> F[URL Extraction]
    F --> G[URL Collection]
    G --> H[EventUrlAnalyzer]
    H --> I[URL Categorization]
    I --> J[Pattern Analysis]
    J --> K[MongoDB Storage]
    
    subgraph "URL Types"
        L[Listing Pages]
        M[Event Detail Pages]
        N[Other Pages]
    end
    
    I --> L
    I --> M
    I --> N
    
    style A fill:#ffebee
    style K fill:#e8f5e8

🧩 Component Details

1. WebCrawler Service

Purpose: Core crawling engine that discovers URLs on websites

Key Features:

Depth-limited crawling (configurable)
Concurrent processing
URL validation and filtering
Duplicate prevention

Methods:

crawl(domain, maxDepth): Main crawling method
processUrl(url, depth, maxDepth): Process individual URLs
extractUrls(html, baseUrl): Extract URLs from HTML
isValidUrl(url, baseUrl): Validate URLs

2. EventUrlAnalyzer Service

Purpose: Analyzes and categorizes URLs into event-related patterns

URL Categories:

Listing Pages: /events, /calendar, /whats-on, etc.
Event Detail Pages: /event/, /show/, /concert/, etc.

Pattern Recognition:

// Listing page patterns
listingPatterns = [
    '/city/', '/distance/', '/events', '/calendar',
    '/whats-on', '/agenda', '/activities', '/programme'
    // ... 50+ patterns
];

// Event detail patterns
eventDetailPatterns = [
    '/e/', '/events/', '/event/', '/show/', '/gig/',
    '/concert/', '/performance/', '/exhibition/'
    // ... 30+ patterns
];

3. PageContentService

Purpose: Handles page content retrieval using Puppeteer

Features:

Browser instance management
Page content extraction
Error handling for failed requests

4. DomainDataModel

Purpose: MongoDB schema for storing crawl results

Schema Structure:

{
    domain: String,           // Target domain
    urls: {
        all: [String],        // All discovered URLs
        listing: [String],    // Listing page URLs
        event: [String]       // Event detail URLs
    },
    patterns: {
        listing: Mixed,       // Listing page patterns
        event: Mixed          // Event detail patterns
    },
    metadata: {
        totalUrls: Number,    // Total URLs found
        crawlDate: Date       // Crawl timestamp
    }
}

🚀 Installation & Setup

Prerequisites

Node.js (v14 or higher)
MongoDB (local or cloud instance)
npm or yarn

Installation Steps

Clone the repository

git clone <repository-url>
cd spider-main

Install dependencies

npm install

Configure MongoDB

# Set MongoDB URI in environment variable
export MONGODB_URI="mongodb+srv://username:password@cluster.mongodb.net/spider"

# Or update config/scraper.config.js directly

Verify installation

node index.js

📖 Usage

Basic Usage

const { startScraping } = require('./controllers/ScraperController');

(async () => {
    try {
        const result = await startScraping('https://example-event-site.com');
        console.log('Scraping completed:', result);
    } catch (error) {
        console.error('Scraping failed:', error);
    }
})();

Test Different Domains

// Test file: test.js
const urls = await webCrawler.crawl("https://choosechicago.com");
console.log(urls);

Expected Output

{
    success: true,
    urlsFound: 150,
    listingPages: 25,
    eventPages: 75
}

⚙️ Configuration

MongoDB Configuration

// config/scraper.config.js
mongodb: {
    uri: process.env.MONGODB_URI || 'mongodb://localhost:27017/spider'
}

Puppeteer Configuration

puppeteer: {
    headless: false,  // Set to true for production
    args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-gpu',
        '--disable-dev-shm-usage'
    ]
}

Crawling Configuration

crawling: {
    maxDepth: 1,        // How deep to crawl
    maxConcurrency: 1,  // Concurrent requests
    waitTime: 10000,    // Wait between requests (ms)
    timeout: 100000     // Request timeout (ms)
}

📚 API Reference

ScraperController

`startScraping(domain)`

Initiates the scraping process for a given domain.

Parameters:

domain (string): The target domain to crawl

Returns:

{
    success: boolean,
    urlsFound: number,
    listingPages: number,
    eventPages: number
}

WebCrawler

`crawl(domain, maxDepth)`

Crawls a domain and returns all discovered URLs.

Parameters:

domain (string): Target domain
maxDepth (number): Maximum crawl depth

Returns: Array of discovered URLs

EventUrlAnalyzer

`analyzeUrls(urls)`

Analyzes an array of URLs and categorizes them.

Parameters:

urls (Array): Array of URLs to analyze

Returns:

{
    listingPages: Array,
    eventPages: Array,
    patterns: Object,
    totalUrls: number
}

🔧 Troubleshooting

Common Issues

1. MongoDB Connection Failed

# Check MongoDB URI
echo $MONGODB_URI

# Verify network connectivity
ping your-mongodb-cluster

2. Puppeteer Launch Issues

// Add these args to puppeteer config
args: [
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-gpu',
    '--disable-dev-shm-usage',
    '--disable-web-security',
    '--disable-features=VizDisplayCompositor'
]

3. Memory Issues

// Reduce concurrency in config
crawling: {
    maxConcurrency: 1,  // Reduce from higher values
    maxDepth: 1         // Reduce crawl depth
}

4. Rate Limiting

// Increase wait time between requests
crawling: {
    waitTime: 15000  // Increase from 10000
}

Debug Mode

// Enable detailed logging
const ScraperLogger = require('./utils/ScraperLogger');
ScraperLogger.setLevel('debug');

📈 Performance Optimization

1. Concurrency Tuning

// Adjust based on target server capacity
crawling: {
    maxConcurrency: 2,  // Increase for faster crawling
    waitTime: 5000      // Decrease for faster crawling
}

2. Memory Management

// Close browser instances properly
await pageContentService.close();
await mongoose.connection.close();

3. URL Filtering

// Add more exclusion patterns
urlPatterns: {
    excluded: ['.pdf', '.jpg', '.png', '.gif', '.css', '.js', '.xml']
}

🔮 Future Enhancements

Planned Features

Distributed Crawling: Support for multiple crawler instances
Advanced Pattern Learning: ML-based pattern recognition
Real-time Monitoring: Web dashboard for crawl progress
API Endpoints: REST API for triggering crawls
Export Formats: CSV, JSON, XML export options
Scheduling: Automated crawl scheduling
Proxy Support: Rotating proxy support for large-scale crawling

Architecture Improvements

Microservices: Split into separate services
Message Queues: Redis/RabbitMQ for job distribution
Caching: Redis caching for frequently accessed data
Monitoring: Prometheus/Grafana integration

📞 Support

For issues and questions:

Review the logs in the logs/ directory

Happy Crawling! 🕷️

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
controllers		controllers
logs		logs
models		models
services		services
utils		utils
.gitignore		.gitignore
README.md		README.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
test.js		test.js
test2.js		test2.js

Folders and files

Latest commit

History

Repository files navigation

🕷️ Web Event Scraper/Spider

📋 Table of Contents

🎯 Overview

Key Features

🏗️ System Architecture

Architecture Layers

1. Entry Point Layer

2. Controller Layer

3. Service Layer

4. Model Layer

5. Utility Layer

6. Configuration Layer

🔄 Workflow Diagram

Detailed Workflow Steps

1. Initialization Phase

2. Service Setup

3. Crawling Phase

4. Analysis Phase

5. Data Persistence

📊 Data Flow

🧩 Component Details

1. WebCrawler Service

2. EventUrlAnalyzer Service

3. PageContentService

4. DomainDataModel

🚀 Installation & Setup

Prerequisites

Installation Steps

📖 Usage

Basic Usage

Test Different Domains

Expected Output

⚙️ Configuration

MongoDB Configuration

Puppeteer Configuration

Crawling Configuration

📚 API Reference

ScraperController

startScraping(domain)

WebCrawler

crawl(domain, maxDepth)

EventUrlAnalyzer

analyzeUrls(urls)

🔧 Troubleshooting

Common Issues

1. MongoDB Connection Failed

2. Puppeteer Launch Issues

3. Memory Issues

4. Rate Limiting

Debug Mode

📈 Performance Optimization

1. Concurrency Tuning

2. Memory Management

3. URL Filtering

🔮 Future Enhancements

Planned Features

Architecture Improvements

📞 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`startScraping(domain)`

`crawl(domain, maxDepth)`

`analyzeUrls(urls)`

Packages