A sophisticated Node.js web scraping application designed to crawl event websites, analyze URL patterns, and categorize event-related pages for data collection and analysis.
- Overview
- System Architecture
- Workflow Diagram
- Data Flow
- Component Details
- Installation & Setup
- Usage
- Configuration
- API Reference
- Troubleshooting
This project is a specialized web crawler that:
- Discovers all URLs on event websites
- Analyzes and categorizes URLs into listing pages vs event detail pages
- Identifies patterns in event website structures
- Stores results in MongoDB for further analysis
- Provides insights for event discovery platforms
- 🔍 Intelligent URL Analysis: Automatically categorizes URLs based on patterns
- 🚀 Concurrent Crawling: Efficient parallel processing with configurable concurrency
- 📊 Pattern Recognition: Identifies common event website structures
- 💾 Data Persistence: MongoDB storage with structured schemas
- 📝 Comprehensive Logging: Winston-based logging system
- ⚙️ Configurable: Flexible configuration for different crawling scenarios
┌─────────────────────────────────────────────────────────────────┐
│ Web Event Scraper │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Entry │ │ Controller │ │ Services │ │
│ │ Point │───▶│ Layer │───▶│ Layer │ │
│ │ (index.js) │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Utils │ │ Models │ │ Config │ │
│ │ Layer │ │ Layer │ │ Layer │ │
│ │ │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ MongoDB │ │ Puppeteer │ │ Winston │ │
│ │ Database │ │ Browser │ │ Logger │ │
│ │ │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
index.js: Main application entry point- Initializes scraping process
- Handles top-level error management
controllers/ScraperController.js: Orchestrates the entire scraping workflow- Manages service coordination
- Handles data persistence
services/WebCrawler.js: Core crawling functionalityservices/EventUrlAnalyzer.js: URL analysis and categorizationservices/PageContentService.js: Page content retrieval
models/DomainDataModel.js: MongoDB schema definitions- Data structure definitions
utils/ScraperLogger.js: Logging functionalityutils/ConcurrencyManager.js: Concurrency control
config/scraper.config.js: Application configuration
graph TD
A[Start: index.js] --> B[Initialize MongoDB Connection]
B --> C[Create Service Instances]
C --> D[Start Web Crawling]
D --> E[Process Domain URLs]
E --> F[Extract URLs from Pages]
F --> G[Validate URLs]
G --> H[Add to URL Collection]
H --> I{More URLs to Process?}
I -->|Yes| E
I -->|No| J[Analyze All URLs]
J --> K[Categorize URLs]
K --> L[Identify Patterns]
L --> M[Save to MongoDB]
M --> N[Return Results]
N --> O[Close Connections]
O --> P[End]
style A fill:#e1f5fe
style P fill:#e8f5e8
style M fill:#fff3e0
// Entry point triggers the workflow
const result = await startScraping('https://indiarunning.com');// Create service instances
const pageContentService = new PageContentService();
const webCrawler = new WebCrawler(pageContentService);
const eventUrlAnalyzer = new EventUrlAnalyzer();// Crawl the domain and collect URLs
const urls = await webCrawler.crawl(domain);// Analyze and categorize URLs
const analysis = await eventUrlAnalyzer.analyzeUrls(urls);// Save results to MongoDB
const domainData = new DomainDataModel({
domain,
urls: { all: urls, listing: analysis.listingPages, event: analysis.eventPages },
patterns: analysis.patterns,
metadata: { totalUrls: analysis.totalUrls, crawlDate: new Date() }
});
await domainData.save();graph LR
A[Target Domain] --> B[WebCrawler]
B --> C[PageContentService]
C --> D[Puppeteer Browser]
D --> E[HTML Content]
E --> F[URL Extraction]
F --> G[URL Collection]
G --> H[EventUrlAnalyzer]
H --> I[URL Categorization]
I --> J[Pattern Analysis]
J --> K[MongoDB Storage]
subgraph "URL Types"
L[Listing Pages]
M[Event Detail Pages]
N[Other Pages]
end
I --> L
I --> M
I --> N
style A fill:#ffebee
style K fill:#e8f5e8
Purpose: Core crawling engine that discovers URLs on websites
Key Features:
- Depth-limited crawling (configurable)
- Concurrent processing
- URL validation and filtering
- Duplicate prevention
Methods:
crawl(domain, maxDepth): Main crawling methodprocessUrl(url, depth, maxDepth): Process individual URLsextractUrls(html, baseUrl): Extract URLs from HTMLisValidUrl(url, baseUrl): Validate URLs
Purpose: Analyzes and categorizes URLs into event-related patterns
URL Categories:
- Listing Pages:
/events,/calendar,/whats-on, etc. - Event Detail Pages:
/event/,/show/,/concert/, etc.
Pattern Recognition:
// Listing page patterns
listingPatterns = [
'/city/', '/distance/', '/events', '/calendar',
'/whats-on', '/agenda', '/activities', '/programme'
// ... 50+ patterns
];
// Event detail patterns
eventDetailPatterns = [
'/e/', '/events/', '/event/', '/show/', '/gig/',
'/concert/', '/performance/', '/exhibition/'
// ... 30+ patterns
];Purpose: Handles page content retrieval using Puppeteer
Features:
- Browser instance management
- Page content extraction
- Error handling for failed requests
Purpose: MongoDB schema for storing crawl results
Schema Structure:
{
domain: String, // Target domain
urls: {
all: [String], // All discovered URLs
listing: [String], // Listing page URLs
event: [String] // Event detail URLs
},
patterns: {
listing: Mixed, // Listing page patterns
event: Mixed // Event detail patterns
},
metadata: {
totalUrls: Number, // Total URLs found
crawlDate: Date // Crawl timestamp
}
}- Node.js (v14 or higher)
- MongoDB (local or cloud instance)
- npm or yarn
- Clone the repository
git clone <repository-url>
cd spider-main- Install dependencies
npm install- Configure MongoDB
# Set MongoDB URI in environment variable
export MONGODB_URI="mongodb+srv://username:password@cluster.mongodb.net/spider"
# Or update config/scraper.config.js directly- Verify installation
node index.jsconst { startScraping } = require('./controllers/ScraperController');
(async () => {
try {
const result = await startScraping('https://example-event-site.com');
console.log('Scraping completed:', result);
} catch (error) {
console.error('Scraping failed:', error);
}
})();// Test file: test.js
const urls = await webCrawler.crawl("https://choosechicago.com");
console.log(urls);{
success: true,
urlsFound: 150,
listingPages: 25,
eventPages: 75
}// config/scraper.config.js
mongodb: {
uri: process.env.MONGODB_URI || 'mongodb://localhost:27017/spider'
}puppeteer: {
headless: false, // Set to true for production
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-gpu',
'--disable-dev-shm-usage'
]
}crawling: {
maxDepth: 1, // How deep to crawl
maxConcurrency: 1, // Concurrent requests
waitTime: 10000, // Wait between requests (ms)
timeout: 100000 // Request timeout (ms)
}Initiates the scraping process for a given domain.
Parameters:
domain(string): The target domain to crawl
Returns:
{
success: boolean,
urlsFound: number,
listingPages: number,
eventPages: number
}Crawls a domain and returns all discovered URLs.
Parameters:
domain(string): Target domainmaxDepth(number): Maximum crawl depth
Returns: Array of discovered URLs
Analyzes an array of URLs and categorizes them.
Parameters:
urls(Array): Array of URLs to analyze
Returns:
{
listingPages: Array,
eventPages: Array,
patterns: Object,
totalUrls: number
}# Check MongoDB URI
echo $MONGODB_URI
# Verify network connectivity
ping your-mongodb-cluster// Add these args to puppeteer config
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-gpu',
'--disable-dev-shm-usage',
'--disable-web-security',
'--disable-features=VizDisplayCompositor'
]// Reduce concurrency in config
crawling: {
maxConcurrency: 1, // Reduce from higher values
maxDepth: 1 // Reduce crawl depth
}// Increase wait time between requests
crawling: {
waitTime: 15000 // Increase from 10000
}// Enable detailed logging
const ScraperLogger = require('./utils/ScraperLogger');
ScraperLogger.setLevel('debug');// Adjust based on target server capacity
crawling: {
maxConcurrency: 2, // Increase for faster crawling
waitTime: 5000 // Decrease for faster crawling
}// Close browser instances properly
await pageContentService.close();
await mongoose.connection.close();// Add more exclusion patterns
urlPatterns: {
excluded: ['.pdf', '.jpg', '.png', '.gif', '.css', '.js', '.xml']
}- Distributed Crawling: Support for multiple crawler instances
- Advanced Pattern Learning: ML-based pattern recognition
- Real-time Monitoring: Web dashboard for crawl progress
- API Endpoints: REST API for triggering crawls
- Export Formats: CSV, JSON, XML export options
- Scheduling: Automated crawl scheduling
- Proxy Support: Rotating proxy support for large-scale crawling
- Microservices: Split into separate services
- Message Queues: Redis/RabbitMQ for job distribution
- Caching: Redis caching for frequently accessed data
- Monitoring: Prometheus/Grafana integration
For issues and questions:
- Review the logs in the
logs/directory
Happy Crawling! 🕷️