Advanced Model Context Protocol toolkit for automated web scraper maintenance, selector generation, and seamless integration with scraping workflows
- π― Visual Element Inspection: Interactive browser-based element selection with real-time highlighting
- π€ AI-Powered Auto-Detection: Intelligent field detection using advanced heuristic algorithms
- π§ Robust Selector Generation: Multiple selector strategies with reliability scoring and stability analysis
- β Automated Validation: Comprehensive selector testing and performance metrics
- π Scraping Integration: Seamless integration with existing scraper scripts or standalone generation
- π Maintenance Monitoring: Automated scraper health checks and proactive issue detection
- ποΈ Multi-Language Code Generation: TypeScript, JavaScript, and Python extractor generation
- π Configuration Management: Easy config updates and intelligent selector mapping
- Integrated Mode: Automatically enhance existing scraper scripts with MCP-generated selectors
- Standalone Mode: Generate complete, ready-to-use extractor scripts from scratch
- Fallback Mode: Create baseline configurations when no scraper script is available
# Clone the repository
git clone https://github.com/datahen/mcp-scraper-toolkit.git
cd mcp-scraper-toolkit
# Install dependencies
npm install
# Build the project
npm run build
# Start development server
npm run dev- Node.js 18+
- Playwright browsers
- MCP-compatible client (Claude Desktop, etc.)
npx playwright installAdd to your Claude Desktop configuration (~/.cursor/mcp.json or similar):
{
"mcpServers": {
"mcp-scraper-toolkit": {
"command": "node",
"args": ["path/to/mcp-scraper-toolkit/dist/server.js"],
"env": {
"NODE_ENV": "production"
}
}
}
}graph TD
A[Load/Create Config] --> B[Initialize Browser]
B --> C[Navigate to Page]
C --> D{Scraper Script Exists?}
D -->|Yes| E[Integrated Mode]
D -->|No| F[Standalone Mode]
E --> G[Inspect Fields]
F --> G
G --> H[Generate Selectors]
H --> I[Validate & Test]
I --> J[Update Config]
J --> K{Integration Mode?}
K -->|Integrated| L[Integrate with Script]
K -->|Standalone| M[Generate Extractor]
L --> N[Done]
M --> N
// 1. Load your existing scraper configuration
await loadScraperConfig({ configPath: "./my-scraper-config.json" });
// 2. Initialize browser for inspection
await initializeBrowser({
headless: false,
viewport: { width: 1280, height: 720 }
});
// 3. Navigate to target page
await navigateToPage({
url: "https://example-store.com/product/123",
waitForSelector: ".product-details"
});
// 4. Inspect fields (manual or auto)
await inspectFieldManually({ fieldName: "title" });
await inspectFieldManually({ fieldName: "price" });
// 5. Integrate with existing scraper script
await integrateWithScraper({
scraperScriptPath: "./existing-scraper.py",
backupOriginal: true
});// 1. Create or load configuration
await generateFallbackConfig({
url: "https://example-store.com",
outputPath: "./new-scraper-config.json"
});
// 2. Load the generated config
await loadScraperConfig({ configPath: "./new-scraper-config.json" });
// 3. Browser setup and navigation
await initializeBrowser({ headless: false });
await navigateToPage({ url: "https://example-store.com/product/123" });
// 4. Auto-detect or manually inspect fields
await autoDetectField({ fieldName: "title" });
await autoDetectField({ fieldName: "price" });
// 5. Generate standalone extractor
await generateExtractorCode({
outputPath: "./extractors/product-extractor.ts",
language: "typescript",
includeTests: true,
includeDocumentation: true
});load_scraper_config- Load existing scraper configurationgenerate_fallback_config- Create baseline configuration for new scrapersupdate_config- Update configuration with new selector mappings
initialize_browser- Start browser with custom optionsnavigate_to_page- Navigate to target URL with optional wait conditionstake_screenshot- Capture page screenshots for documentationclose_browser- Cleanup browser resources
inspect_field_manually- Visual element selection with interactive overlayauto_detect_field- AI-powered element detection using heuristicsgenerate_selectors- Create multiple selector variations for elements
validate_selectors- Test selector reliability and performancetest_extraction- Validate data extraction with current selectorsrun_maintenance_check- Comprehensive scraper health analysis
integrate_with_scraper- Enhance existing scraper scriptsgenerate_extractor_code- Create standalone extractor code
// Complete workflow for e-commerce product scraping
const workflow = async () => {
// Initialize
await initializeBrowser({ headless: false });
await navigateToPage({
url: "https://shop.example.com/products/laptop-xyz",
waitForSelector: ".product-container"
});
// Auto-detect common e-commerce fields
const fields = ['title', 'price', 'description', 'image', 'rating', 'availability'];
for (const field of fields) {
const candidates = await autoDetectField({ fieldName: field });
console.log(`Detected ${candidates.length} candidates for ${field}`);
}
// Manual refinement for complex fields
await inspectFieldManually({ fieldName: "specifications" });
await inspectFieldManually({ fieldName: "reviews_count" });
// Generate Python extractor with tests
await generateExtractorCode({
outputPath: "./ecommerce_extractor.py",
language: "python",
includeTests: true,
includeDocumentation: true
});
// Run maintenance check
const report = await runMaintenanceCheck({
url: "https://shop.example.com/products/laptop-xyz"
});
console.log(`Extraction success rate: ${report.successRate}%`);
};// Enhance existing news scraper with MCP-generated selectors
const enhanceNewsScraper = async () => {
// Load existing configuration
await loadScraperConfig({ configPath: "./news-scraper-config.json" });
// Browser setup
await initializeBrowser({ headless: true });
await navigateToPage({
url: "https://news.example.com/article/12345"
});
// Detect article components
await autoDetectField({ fieldName: "headline" });
await autoDetectField({ fieldName: "author" });
await autoDetectField({ fieldName: "publish_date" });
await autoDetectField({ fieldName: "content" });
await autoDetectField({ fieldName: "tags" });
// Validate all selectors
const validation = await validateSelectors({
selectors: [
"h1.article-title",
".author-name",
"time.publish-date",
".article-content",
".tag-list a"
]
});
// Integrate with existing TypeScript scraper
await integrateWithScraper({
scraperScriptPath: "./src/news-scraper.ts",
backupOriginal: true
});
console.log("News scraper enhanced successfully!");
};# Python script for automated scraper maintenance
import asyncio
from mcp_toolkit import MCPScraperToolkit
async def daily_maintenance_check():
toolkit = MCPScraperToolkit()
# List of scrapers to check
scrapers = [
{"config": "./configs/ecommerce.json", "url": "https://shop.example.com"},
{"config": "./configs/news.json", "url": "https://news.example.com"},
{"config": "./configs/jobs.json", "url": "https://jobs.example.com"}
]
for scraper in scrapers:
print(f"Checking {scraper['config']}...")
report = await toolkit.run_maintenance_check(
config_path=scraper['config'],
url=scraper['url']
)
success_rate = report['successRate']
if success_rate < 70:
print(f"β οΈ {scraper['config']} needs attention! Success rate: {success_rate}%")
# Send alert to monitoring system
await send_alert(f"Scraper maintenance required: {scraper['config']}")
else:
print(f"β
{scraper['config']} is healthy. Success rate: {success_rate}%")
if __name__ == "__main__":
asyncio.run(daily_maintenance_check()){
"scraper_config": {
"name": "E-commerce Product Scraper",
"website_url": "https://example-store.com",
"schema": {
"title": {
"type": "string",
"description": "Product title or name",
"required": true
},
"price": {
"type": "string",
"description": "Current product price",
"required": true
},
"image_url": {
"type": "string",
"description": "Main product image URL",
"required": true
}
}
},
"integration": {
"integration_mode": "standalone",
"output_format": "typescript",
"auto_generate_fallback": true
}
}{
"scraper_config": {
"name": "E-commerce Product Scraper",
"website_url": "https://example-store.com",
"schema": { "..." }
},
"data_extraction": {
"selector_mappings": {
"title": {
"field": "title",
"selector": "h1.product-title",
"method": "CSS",
"description": "Product title selector",
"extraction_method": "text",
"confidence_score": 95
},
"price": {
"field": "price",
"selector": ".price-current .price-value",
"method": "CSS",
"description": "Product price selector",
"extraction_method": "text",
"confidence_score": 88
},
"image_url": {
"field": "image_url",
"selector": "#product-image img",
"method": "CSS",
"description": "Product image selector",
"extraction_method": "attribute",
"attribute_name": "src",
"confidence_score": 92
}
},
"extraction_strategy": "CSS_SELECTORS",
"created_at": "2024-01-15T10:30:00Z",
"total_fields": 3
},
"integration": {
"integration_mode": "standalone",
"output_format": "typescript",
"auto_generate_fallback": true
}
}The toolkit uses a sophisticated scoring system to rank selector reliability:
| Selector Type | Base Score | Description |
|---|---|---|
| ID selectors | 100 | Highest reliability |
Data attributes (data-test, data-cy) |
90 | Very reliable for testing |
| ARIA attributes | 80 | Semantic and stable |
| Semantic classes | 60 | Meaningful class names |
| Tag + class combinations | 40 | Moderately reliable |
| Text selectors | 30 | Content-dependent |
| Generic tags | 10 | Lowest reliability |
Penalties:
- Generated/dynamic classes: -30 points
- Non-unique selectors: -10 to -50 points based on match count
-
Smart Detection: Field name-based heuristics
- Analyzes field names and maps to common patterns
- Uses domain knowledge for e-commerce, news, jobs, etc.
-
Text Matching: Content-based element detection
- Searches for text patterns matching field semantics
- Handles multi-language content
-
Structural Matching: Class/ID pattern matching
- Identifies semantic HTML structures
- Recognizes common web component patterns
-
Contextual Analysis: Surrounding element analysis
- Considers element positioning and relationships
- Analyzes parent-child hierarchies
Automatic detection of potentially unstable selectors:
- Dynamic/Generated Classes: CSS classes with random hashes
- Positional Selectors:
:nth-child,:first-childdependencies - Overly Specific Selectors: Long, brittle selector chains
- Content-Dependent Selectors: Text-based selectors with dynamic values
// Start with basic config
const basicConfig = await generateFallbackConfig({
url: "https://example.com",
outputPath: "./basic-config.json"
});
// Iteratively add fields
for (const field of ['title', 'content', 'author', 'date']) {
await autoDetectField({ fieldName: field });
await updateConfig({
mappings: [/* detected mappings */],
outputPath: "./enhanced-config.json"
});
}
// Generate final extractor
await generateExtractorCode({
outputPath: "./final-extractor.ts",
language: "typescript",
includeTests: true
});# Migrate multiple existing scrapers to MCP
async def migrate_scrapers():
scraper_paths = [
"./legacy/scraper1.py",
"./legacy/scraper2.py",
"./legacy/scraper3.py"
]
for path in scraper_paths:
# Create config from legacy scraper
config = await analyze_legacy_scraper(path)
# Generate MCP configuration
await load_scraper_config(config)
# Auto-detect all fields
for field in config.fields:
await auto_detect_field(field)
# Integrate with existing script
await integrate_with_scraper(path)
print(f"Migrated {path} successfully!")// Generate extractor with comprehensive tests
await generateExtractorCode({
outputPath: "./extractor.ts",
language: "typescript",
includeTests: true,
includeDocumentation: true
});
// This creates:
// - extractor.ts (main extractor)
// - extractor.test.ts (test suite)
// - README.md (documentation)# Run the generated test suite
npm test
# Test specific extractor
node extractor.js https://example.com/test-page
# Run with debugging
DEBUG=1 node extractor.js https://example.com/test-page# .github/workflows/scraper-health.yml
name: Scraper Health Check
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
jobs:
health-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run Health Check
run: |
npm install
npm run health-check// Prefer semantic selectors
β
Good: '[data-testid="product-title"]'
β
Good: 'h1.product-name'
β Avoid: '.css-1a2b3c4'
β Avoid: 'div:nth-child(3) > span:nth-child(2)'// Always include error handling in extractors
try {
const price = await page.locator('.price').textContent();
data.price = price?.trim() || null;
} catch (error) {
console.warn('Failed to extract price:', error);
data.price = null;
}// Use environment-specific configurations
const config = process.env.NODE_ENV === 'production'
? './configs/prod-config.json'
: './configs/dev-config.json';
await loadScraperConfig({ configPath: config });// Schedule regular maintenance checks
const schedule = {
daily: ['high-priority-scrapers'],
weekly: ['medium-priority-scrapers'],
monthly: ['low-priority-scrapers']
};
for (const [frequency, scrapers] of Object.entries(schedule)) {
scheduleMaintenance(frequency, scrapers);
}Problem: Browser fails to start
Error: Failed to launch browser
Solutions:
# Install browser dependencies
npx playwright install-deps
# Use different browser
await initializeBrowser({
browser: 'firefox', // Try firefox instead of chromium
headless: true
});
# Check system resources
await initializeBrowser({
args: ['--no-sandbox', '--disable-dev-shm-usage']
});Problem: Previously working selectors stop functioning
β Selector validation failed: Element not found
Solutions:
// Run maintenance check to identify issues
const report = await runMaintenanceCheck({
url: "https://target-site.com",
configPath: "./config.json"
});
// Re-inspect problematic fields
for (const issue of report.issues) {
if (issue.severity === 'high') {
await inspectFieldManually({ fieldName: issue.field });
}
}Problem: Failed to integrate with existing scraper
Error: Could not parse existing scraper script
Solutions:
// Check file permissions and syntax
await fs.access(scraperPath, fs.constants.R_OK | fs.constants.W_OK);
// Use backup and manual integration
await integrateWithScraper({
scraperScriptPath: "./scraper.py",
backupOriginal: true // Always create backup
});
// Manual integration as fallback
const extractorCode = await generateExtractorCode({
outputPath: "./new-extractor.py",
language: "python"
});# Enable detailed logging
DEBUG=mcp:* npm run dev
# Browser debugging (visible browser)
await initializeBrowser({
headless: false,
devtools: true
});// Optimize for performance
await initializeBrowser({
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-background-timer-throttling',
'--disable-backgrounding-occluded-windows',
'--disable-renderer-backgrounding'
]
});
// Use faster selectors
const fastSelectors = selectors.filter(s =>
s.startsWith('#') || s.startsWith('[data-test')
);// Optimized browser configuration
await initializeBrowser({
headless: true,
timeout: 15000,
viewport: { width: 1280, height: 720 },
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-background-timer-throttling'
]
});- Use specific selectors to reduce DOM search time
- Avoid complex XPath expressions - prefer CSS selectors
- Minimize DOM traversal depth - target elements directly
- Cache validation results for repeated selector checks
// Process multiple fields concurrently
const fieldPromises = fields.map(async (field) => {
return await autoDetectField({ fieldName: field });
});
const results = await Promise.all(fieldPromises);We welcome contributions! Please see our Contributing Guidelines for details.
# Clone the repository
git clone https://github.com/datahen/mcp-scraper-toolkit.git
# Install dependencies
npm install
# Run in development mode
npm run dev
# Run tests
npm test
# Build for production
npm run build- TypeScript: Strict type checking enabled
- ESLint: Code linting and formatting
- Prettier: Consistent code formatting
- Jest: Unit and integration testing
MIT License - see LICENSE file for details.
- GitHub Issues: Report bugs and request features
- Documentation: Full documentation and guides
- Community: Join discussions
- Playwright Team: For the excellent browser automation framework
- MCP Community: For the Model Context Protocol standard
- DataHen Team: For scraping expertise and best practices
- Contributors: All developers who have contributed to this project
Built with β€οΈ for the web scraping community
MCP Scraper Toolkit - Making web scraping maintenance intelligent, automated, and enjoyable.