MCP Scraper Toolkit 🕷️✨

Advanced Model Context Protocol toolkit for automated web scraper maintenance, selector generation, and seamless integration with scraping workflows

🌟 Features

Core Capabilities

🎯 Visual Element Inspection: Interactive browser-based element selection with real-time highlighting
🤖 AI-Powered Auto-Detection: Intelligent field detection using advanced heuristic algorithms
🔧 Robust Selector Generation: Multiple selector strategies with reliability scoring and stability analysis
✅ Automated Validation: Comprehensive selector testing and performance metrics
🔗 Scraping Integration: Seamless integration with existing scraper scripts or standalone generation
📊 Maintenance Monitoring: Automated scraper health checks and proactive issue detection
🏗️ Multi-Language Code Generation: TypeScript, JavaScript, and Python extractor generation
📋 Configuration Management: Easy config updates and intelligent selector mapping

Integration Modes

Integrated Mode: Automatically enhance existing scraper scripts with MCP-generated selectors
Standalone Mode: Generate complete, ready-to-use extractor scripts from scratch
Fallback Mode: Create baseline configurations when no scraper script is available

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/datahen/mcp-scraper-toolkit.git
cd mcp-scraper-toolkit

# Install dependencies
npm install

# Build the project
npm run build

# Start development server
npm run dev

Prerequisites

Node.js 18+
Playwright browsers
MCP-compatible client (Claude Desktop, etc.)

Install Playwright Browsers

npx playwright install

Configuration for Claude Desktop

Add to your Claude Desktop configuration (~/.cursor/mcp.json or similar):

{
  "mcpServers": {
    "mcp-scraper-toolkit": {
      "command": "node",
      "args": ["path/to/mcp-scraper-toolkit/dist/server.js"],
      "env": {
        "NODE_ENV": "production"
      }
    }
  }
}

📚 Usage Guide

1. Basic Workflow

graph TD
    A[Load/Create Config] --> B[Initialize Browser]
    B --> C[Navigate to Page]
    C --> D{Scraper Script Exists?}
    D -->|Yes| E[Integrated Mode]
    D -->|No| F[Standalone Mode]
    E --> G[Inspect Fields]
    F --> G
    G --> H[Generate Selectors]
    H --> I[Validate & Test]
    I --> J[Update Config]
    J --> K{Integration Mode?}
    K -->|Integrated| L[Integrate with Script]
    K -->|Standalone| M[Generate Extractor]
    L --> N[Done]
    M --> N

2. Integration Scenarios

Scenario A: Enhancing Existing Scraper

// 1. Load your existing scraper configuration
await loadScraperConfig({ configPath: "./my-scraper-config.json" });

// 2. Initialize browser for inspection
await initializeBrowser({ 
  headless: false,
  viewport: { width: 1280, height: 720 }
});

// 3. Navigate to target page
await navigateToPage({ 
  url: "https://example-store.com/product/123",
  waitForSelector: ".product-details"
});

// 4. Inspect fields (manual or auto)
await inspectFieldManually({ fieldName: "title" });
await inspectFieldManually({ fieldName: "price" });

// 5. Integrate with existing scraper script
await integrateWithScraper({ 
  scraperScriptPath: "./existing-scraper.py",
  backupOriginal: true
});

Scenario B: Standalone Extractor Generation

// 1. Create or load configuration
await generateFallbackConfig({ 
  url: "https://example-store.com",
  outputPath: "./new-scraper-config.json"
});

// 2. Load the generated config
await loadScraperConfig({ configPath: "./new-scraper-config.json" });

// 3. Browser setup and navigation
await initializeBrowser({ headless: false });
await navigateToPage({ url: "https://example-store.com/product/123" });

// 4. Auto-detect or manually inspect fields
await autoDetectField({ fieldName: "title" });
await autoDetectField({ fieldName: "price" });

// 5. Generate standalone extractor
await generateExtractorCode({
  outputPath: "./extractors/product-extractor.ts",
  language: "typescript",
  includeTests: true,
  includeDocumentation: true
});

3. MCP Tools Reference

Configuration Management

load_scraper_config - Load existing scraper configuration
generate_fallback_config - Create baseline configuration for new scrapers
update_config - Update configuration with new selector mappings

Browser Operations

initialize_browser - Start browser with custom options
navigate_to_page - Navigate to target URL with optional wait conditions
take_screenshot - Capture page screenshots for documentation
close_browser - Cleanup browser resources

Element Inspection & Detection

inspect_field_manually - Visual element selection with interactive overlay
auto_detect_field - AI-powered element detection using heuristics
generate_selectors - Create multiple selector variations for elements

Validation & Testing

validate_selectors - Test selector reliability and performance
test_extraction - Validate data extraction with current selectors
run_maintenance_check - Comprehensive scraper health analysis

Integration & Code Generation

integrate_with_scraper - Enhance existing scraper scripts
generate_extractor_code - Create standalone extractor code

🎯 Advanced Examples

Example 1: E-commerce Product Scraper

// Complete workflow for e-commerce product scraping
const workflow = async () => {
  // Initialize
  await initializeBrowser({ headless: false });
  await navigateToPage({ 
    url: "https://shop.example.com/products/laptop-xyz",
    waitForSelector: ".product-container"
  });

  // Auto-detect common e-commerce fields
  const fields = ['title', 'price', 'description', 'image', 'rating', 'availability'];
  for (const field of fields) {
    const candidates = await autoDetectField({ fieldName: field });
    console.log(`Detected ${candidates.length} candidates for ${field}`);
  }

  // Manual refinement for complex fields
  await inspectFieldManually({ fieldName: "specifications" });
  await inspectFieldManually({ fieldName: "reviews_count" });

  // Generate Python extractor with tests
  await generateExtractorCode({
    outputPath: "./ecommerce_extractor.py",
    language: "python",
    includeTests: true,
    includeDocumentation: true
  });

  // Run maintenance check
  const report = await runMaintenanceCheck({
    url: "https://shop.example.com/products/laptop-xyz"
  });
  
  console.log(`Extraction success rate: ${report.successRate}%`);
};

Example 2: News Article Scraper Integration

// Enhance existing news scraper with MCP-generated selectors
const enhanceNewsScraper = async () => {
  // Load existing configuration
  await loadScraperConfig({ configPath: "./news-scraper-config.json" });

  // Browser setup
  await initializeBrowser({ headless: true });
  await navigateToPage({ 
    url: "https://news.example.com/article/12345"
  });

  // Detect article components
  await autoDetectField({ fieldName: "headline" });
  await autoDetectField({ fieldName: "author" });
  await autoDetectField({ fieldName: "publish_date" });
  await autoDetectField({ fieldName: "content" });
  await autoDetectField({ fieldName: "tags" });

  // Validate all selectors
  const validation = await validateSelectors({
    selectors: [
      "h1.article-title",
      ".author-name",
      "time.publish-date",
      ".article-content",
      ".tag-list a"
    ]
  });

  // Integrate with existing TypeScript scraper
  await integrateWithScraper({
    scraperScriptPath: "./src/news-scraper.ts",
    backupOriginal: true
  });

  console.log("News scraper enhanced successfully!");
};

Example 3: Monitoring & Maintenance

# Python script for automated scraper maintenance
import asyncio
from mcp_toolkit import MCPScraperToolkit

async def daily_maintenance_check():
    toolkit = MCPScraperToolkit()
    
    # List of scrapers to check
    scrapers = [
        {"config": "./configs/ecommerce.json", "url": "https://shop.example.com"},
        {"config": "./configs/news.json", "url": "https://news.example.com"},
        {"config": "./configs/jobs.json", "url": "https://jobs.example.com"}
    ]
    
    for scraper in scrapers:
        print(f"Checking {scraper['config']}...")
        
        report = await toolkit.run_maintenance_check(
            config_path=scraper['config'],
            url=scraper['url']
        )
        
        success_rate = report['successRate']
        
        if success_rate < 70:
            print(f"⚠️ {scraper['config']} needs attention! Success rate: {success_rate}%")
            # Send alert to monitoring system
            await send_alert(f"Scraper maintenance required: {scraper['config']}")
        else:
            print(f"✅ {scraper['config']} is healthy. Success rate: {success_rate}%")

if __name__ == "__main__":
    asyncio.run(daily_maintenance_check())

📊 Configuration Format

Input Configuration Schema

{
  "scraper_config": {
    "name": "E-commerce Product Scraper",
    "website_url": "https://example-store.com",
    "schema": {
      "title": {
        "type": "string",
        "description": "Product title or name",
        "required": true
      },
      "price": {
        "type": "string",
        "description": "Current product price",
        "required": true
      },
      "image_url": {
        "type": "string",
        "description": "Main product image URL",
        "required": true
      }
    }
  },
  "integration": {
    "integration_mode": "standalone",
    "output_format": "typescript",
    "auto_generate_fallback": true
  }
}

Enhanced Configuration (Output)

{
  "scraper_config": {
    "name": "E-commerce Product Scraper",
    "website_url": "https://example-store.com",
    "schema": { "..." }
  },
  "data_extraction": {
    "selector_mappings": {
      "title": {
        "field": "title",
        "selector": "h1.product-title",
        "method": "CSS",
        "description": "Product title selector",
        "extraction_method": "text",
        "confidence_score": 95
      },
      "price": {
        "field": "price",
        "selector": ".price-current .price-value",
        "method": "CSS",
        "description": "Product price selector",
        "extraction_method": "text",
        "confidence_score": 88
      },
      "image_url": {
        "field": "image_url",
        "selector": "#product-image img",
        "method": "CSS",
        "description": "Product image selector",
        "extraction_method": "attribute",
        "attribute_name": "src",
        "confidence_score": 92
      }
    },
    "extraction_strategy": "CSS_SELECTORS",
    "created_at": "2024-01-15T10:30:00Z",
    "total_fields": 3
  },
  "integration": {
    "integration_mode": "standalone",
    "output_format": "typescript",
    "auto_generate_fallback": true
  }
}

🔧 Advanced Features

Selector Reliability Scoring

The toolkit uses a sophisticated scoring system to rank selector reliability:

Selector Type	Base Score	Description
ID selectors	100	Highest reliability
Data attributes (`data-test`, `data-cy`)	90	Very reliable for testing
ARIA attributes	80	Semantic and stable
Semantic classes	60	Meaningful class names
Tag + class combinations	40	Moderately reliable
Text selectors	30	Content-dependent
Generic tags	10	Lowest reliability

Penalties:

Generated/dynamic classes: -30 points
Non-unique selectors: -10 to -50 points based on match count

Auto-Detection Strategies

Smart Detection: Field name-based heuristics
- Analyzes field names and maps to common patterns
- Uses domain knowledge for e-commerce, news, jobs, etc.
Text Matching: Content-based element detection
- Searches for text patterns matching field semantics
- Handles multi-language content
Structural Matching: Class/ID pattern matching
- Identifies semantic HTML structures
- Recognizes common web component patterns
Contextual Analysis: Surrounding element analysis
- Considers element positioning and relationships
- Analyzes parent-child hierarchies

Stability Analysis

Automatic detection of potentially unstable selectors:

Dynamic/Generated Classes: CSS classes with random hashes
Positional Selectors: :nth-child, :first-child dependencies
Overly Specific Selectors: Long, brittle selector chains
Content-Dependent Selectors: Text-based selectors with dynamic values

🔄 Integration Workflows

Workflow 1: Gradual Enhancement

// Start with basic config
const basicConfig = await generateFallbackConfig({
  url: "https://example.com",
  outputPath: "./basic-config.json"
});

// Iteratively add fields
for (const field of ['title', 'content', 'author', 'date']) {
  await autoDetectField({ fieldName: field });
  await updateConfig({ 
    mappings: [/* detected mappings */],
    outputPath: "./enhanced-config.json"
  });
}

// Generate final extractor
await generateExtractorCode({
  outputPath: "./final-extractor.ts",
  language: "typescript",
  includeTests: true
});

Workflow 2: Bulk Migration

# Migrate multiple existing scrapers to MCP
async def migrate_scrapers():
    scraper_paths = [
        "./legacy/scraper1.py",
        "./legacy/scraper2.py",
        "./legacy/scraper3.py"
    ]
    
    for path in scraper_paths:
        # Create config from legacy scraper
        config = await analyze_legacy_scraper(path)
        
        # Generate MCP configuration
        await load_scraper_config(config)
        
        # Auto-detect all fields
        for field in config.fields:
            await auto_detect_field(field)
        
        # Integrate with existing script
        await integrate_with_scraper(path)
        
        print(f"Migrated {path} successfully!")

🧪 Testing & Quality Assurance

Built-in Testing Features

// Generate extractor with comprehensive tests
await generateExtractorCode({
  outputPath: "./extractor.ts",
  language: "typescript",
  includeTests: true,
  includeDocumentation: true
});

// This creates:
// - extractor.ts (main extractor)
// - extractor.test.ts (test suite)
// - README.md (documentation)

Manual Testing

# Run the generated test suite
npm test

# Test specific extractor
node extractor.js https://example.com/test-page

# Run with debugging
DEBUG=1 node extractor.js https://example.com/test-page

Continuous Monitoring

# .github/workflows/scraper-health.yml
name: Scraper Health Check
on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours

jobs:
  health-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Run Health Check
        run: |
          npm install
          npm run health-check

📋 Best Practices

1. Selector Strategy

// Prefer semantic selectors
✅ Good: '[data-testid="product-title"]'
✅ Good: 'h1.product-name'
❌ Avoid: '.css-1a2b3c4'
❌ Avoid: 'div:nth-child(3) > span:nth-child(2)'

2. Error Handling

// Always include error handling in extractors
try {
  const price = await page.locator('.price').textContent();
  data.price = price?.trim() || null;
} catch (error) {
  console.warn('Failed to extract price:', error);
  data.price = null;
}

3. Configuration Management

// Use environment-specific configurations
const config = process.env.NODE_ENV === 'production' 
  ? './configs/prod-config.json'
  : './configs/dev-config.json';

await loadScraperConfig({ configPath: config });

4. Maintenance Scheduling

// Schedule regular maintenance checks
const schedule = {
  daily: ['high-priority-scrapers'],
  weekly: ['medium-priority-scrapers'],
  monthly: ['low-priority-scrapers']
};

for (const [frequency, scrapers] of Object.entries(schedule)) {
  scheduleMaintenance(frequency, scrapers);
}

🚨 Troubleshooting

Common Issues

1. Browser Initialization Failures

Problem: Browser fails to start

Error: Failed to launch browser

Solutions:

# Install browser dependencies
npx playwright install-deps

# Use different browser
await initializeBrowser({ 
  browser: 'firefox',  // Try firefox instead of chromium
  headless: true 
});

# Check system resources
await initializeBrowser({ 
  args: ['--no-sandbox', '--disable-dev-shm-usage'] 
});

2. Selector Validation Failures

Problem: Previously working selectors stop functioning

❌ Selector validation failed: Element not found

Solutions:

// Run maintenance check to identify issues
const report = await runMaintenanceCheck({
  url: "https://target-site.com",
  configPath: "./config.json"
});

// Re-inspect problematic fields
for (const issue of report.issues) {
  if (issue.severity === 'high') {
    await inspectFieldManually({ fieldName: issue.field });
  }
}

3. Integration Problems

Problem: Failed to integrate with existing scraper

Error: Could not parse existing scraper script

Solutions:

// Check file permissions and syntax
await fs.access(scraperPath, fs.constants.R_OK | fs.constants.W_OK);

// Use backup and manual integration
await integrateWithScraper({
  scraperScriptPath: "./scraper.py",
  backupOriginal: true  // Always create backup
});

// Manual integration as fallback
const extractorCode = await generateExtractorCode({
  outputPath: "./new-extractor.py",
  language: "python"
});

Debug Mode

# Enable detailed logging
DEBUG=mcp:* npm run dev

# Browser debugging (visible browser)
await initializeBrowser({ 
  headless: false,
  devtools: true 
});

Performance Issues

// Optimize for performance
await initializeBrowser({
  headless: true,
  args: [
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-dev-shm-usage',
    '--disable-background-timer-throttling',
    '--disable-backgrounding-occluded-windows',
    '--disable-renderer-backgrounding'
  ]
});

// Use faster selectors
const fastSelectors = selectors.filter(s => 
  s.startsWith('#') || s.startsWith('[data-test')
);

📈 Performance Optimization

Browser Settings

// Optimized browser configuration
await initializeBrowser({
  headless: true,
  timeout: 15000,
  viewport: { width: 1280, height: 720 },
  args: [
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-dev-shm-usage',
    '--disable-background-timer-throttling'
  ]
});

Selector Optimization

Use specific selectors to reduce DOM search time
Avoid complex XPath expressions - prefer CSS selectors
Minimize DOM traversal depth - target elements directly
Cache validation results for repeated selector checks

Parallel Processing

// Process multiple fields concurrently
const fieldPromises = fields.map(async (field) => {
  return await autoDetectField({ fieldName: field });
});

const results = await Promise.all(fieldPromises);

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

# Clone the repository
git clone https://github.com/datahen/mcp-scraper-toolkit.git

# Install dependencies
npm install

# Run in development mode
npm run dev

# Run tests
npm test

# Build for production
npm run build

Code Standards

TypeScript: Strict type checking enabled
ESLint: Code linting and formatting
Prettier: Consistent code formatting
Jest: Unit and integration testing

📄 License

MIT License - see LICENSE file for details.

🆘 Support

GitHub Issues: Report bugs and request features
Documentation: Full documentation and guides
Community: Join discussions

🏆 Acknowledgments

Playwright Team: For the excellent browser automation framework
MCP Community: For the Model Context Protocol standard
DataHen Team: For scraping expertise and best practices
Contributors: All developers who have contributed to this project

Built with ❤️ for the web scraping community

MCP Scraper Toolkit - Making web scraping maintenance intelligent, automated, and enjoyable.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
scripts		scripts
src		src
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierrc		.prettierrc
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

License

fuahyo/mcp-scraper-toolkit

Folders and files

Latest commit

History

Repository files navigation

MCP Scraper Toolkit 🕷️✨

🌟 Features

Core Capabilities

Integration Modes

🚀 Quick Start

Installation

Prerequisites

Install Playwright Browsers

Configuration for Claude Desktop

📚 Usage Guide

1. Basic Workflow

2. Integration Scenarios

Scenario A: Enhancing Existing Scraper

Scenario B: Standalone Extractor Generation

3. MCP Tools Reference

Configuration Management

Browser Operations

Element Inspection & Detection

Validation & Testing

Integration & Code Generation

🎯 Advanced Examples

Example 1: E-commerce Product Scraper

Example 2: News Article Scraper Integration

Example 3: Monitoring & Maintenance

📊 Configuration Format

Input Configuration Schema

Enhanced Configuration (Output)

🔧 Advanced Features

Selector Reliability Scoring

Auto-Detection Strategies

Stability Analysis

🔄 Integration Workflows

Workflow 1: Gradual Enhancement

Workflow 2: Bulk Migration

🧪 Testing & Quality Assurance

Built-in Testing Features

Manual Testing

Continuous Monitoring

📋 Best Practices

1. Selector Strategy

2. Error Handling

3. Configuration Management

4. Maintenance Scheduling

🚨 Troubleshooting

Common Issues

1. Browser Initialization Failures

2. Selector Validation Failures

3. Integration Problems

Debug Mode

Performance Issues

📈 Performance Optimization

Browser Settings

Selector Optimization

Parallel Processing

🤝 Contributing

Development Setup

Code Standards

📄 License

🆘 Support

🏆 Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages