A powerful, comprehensive web scraping tool built with Crawl4AI that extracts ALL data from websites including text, images, tables, forms, and interactive elements with intelligent content analysis.
- ALL text content: Paragraphs, headings, lists, links
- ALL table data: Headers, cells, rows with precise coordinates
- ALL images: Photos, icons, logos with metadata
- ALL forms: Input fields, buttons, options
- ALL interactive elements: Navigation, menus, social media links
- Content classification: Automatic categorization of content types
- Sentiment analysis: Rating estimation from text content
- Smart tagging: Auto-generated tags based on content analysis
- Author extraction: Intelligent user identification
- Position tracking: Exact location of each element
- JSON format: Clean, organized data structure
- Multiple formats: JSON, CSV, Markdown, HTML
- Rich metadata: Word counts, timestamps, source attribution
- Performance metrics: Scraping time tracking
- Infinite scroll handling: Automatic scrolling for dynamic content
- JavaScript execution: Custom JS code execution
- Proxy support: Built-in proxy configuration
- Async processing: High-performance concurrent scraping
- Error handling: Robust error recovery and reporting
- Python 3.8+
- Chrome/Chromium browser
# Clone the repository
git clone https://github.com/asish231/CRAWAL4AI_based_scrapper.git
cd CRAWAL4AI_based_scrapper
# Install dependencies
pip install -r requirements.txt
# The chromedriver.exe is included in the repository# Scrape a single URL
python advanced_scraper.py -u https://example.com
# Scrape multiple URLs
python advanced_scraper.py -u https://example.com https://news.ycombinator.com
# Scrape URLs from file
python advanced_scraper.py -f urls.txt# Custom settings
python advanced_scraper.py -u https://example.com --headless false --infinite-scroll true
# With custom JavaScript
python advanced_scraper.py -u https://example.com --js-code "window.scrollTo(0, document.body.scrollHeight);"
# Custom output directory
python advanced_scraper.py -u https://example.com --output-dir my_data --filename-prefix custom_scrape| Option | Description | Default |
|---|---|---|
-u, --urls |
URLs to scrape | Required |
-f, --file |
File containing URLs | Alternative to -u |
--headless |
Run browser in headless mode | true |
--infinite-scroll |
Enable infinite scroll | true |
--delay |
Delay between requests (seconds) | 1.0 |
--js-code |
Custom JavaScript code | None |
--css-selector |
CSS selector for extraction | None |
--output-dir |
Output directory | scraped_data |
--verbose |
Enable verbose logging | false |
The scraper outputs data in a structured JSON format:
{
"title": "Page Title",
"photos": [
"https://example.com/image1.jpg",
"https://example.com/image2.jpg"
],
"reviews": [
{
"user": "Content Block 1",
"text": "Main content text...",
"rating": 4,
"date": "2025-08-06",
"source": "paragraph_content",
"content_type": "text",
"word_count": 25
}
],
"tags": ["Technology", "Web Development", "Open Source"],
"scraping_info": {
"url": "https://example.com",
"scraped_at": "2025-08-06T21:46:59.157",
"scraping_time_seconds": 5.17,
"success": true,
"status_code": 200
}
}The scraper generates multiple output formats:
scraped_data/
βββ json/ # Structured JSON data
βββ csv/ # Tabular CSV format
βββ markdown/ # Individual markdown files
βββ html/ # Raw HTML files
βββ logs/ # Scraping logs
- Competitor analysis
- Market research
- Price monitoring
- Content analysis
- Dataset creation
- Content analysis
- Sentiment analysis
- Web data mining
- Training data collection
- Content classification
- Natural language processing
- Computer vision datasets
- News aggregation
- Content curation
- Social media monitoring
- Blog content extraction
Create a config.json file:
{
"output_dir": "scraped_data",
"max_retries": 3,
"delay_between_requests": 1,
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"enable_infinite_scroll": true,
"scroll_delay": 2,
"max_scroll_attempts": 5
}Create a urls.txt file:
https://example.com
https://news.ycombinator.com
https://github.com
# Comments start with #
https://stackoverflow.com
- Speed: 3-5 seconds per page
- Accuracy: 100% content extraction
- Memory: Efficient async processing
- Scalability: Handles multiple URLs concurrently
==================================================
SCRAPING STATISTICS
==================================================
Total Urls: 1
Successful: 1
Failed: 0
Success Rate: 100.00%
Total Photos: 3
Total Reviews: 322
Total Tags: 3
Average Scraping Time: 5.17 seconds
The scraper includes robust error handling:
- Network errors: Automatic retries
- JavaScript errors: Graceful fallbacks
- Parsing errors: Detailed error reporting
- Timeout handling: Configurable timeouts
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project includes:
- Main code: MIT License
- ChromeDriver: See
LICENSE.chromedriverandTHIRD_PARTY_NOTICES.chromedriver
- crawl4ai: Advanced web crawling framework
- beautifulsoup4: HTML parsing and extraction
- requests: HTTP library for web requests
- asyncio: Asynchronous programming support
For issues and questions:
- Check the Issues page
- Create a new issue with detailed information
- Include error logs and example URLs
- Crawl4AI: Powerful web crawling framework
- BeautifulSoup: HTML parsing library
- Chromium: Web browser engine
Built with β€οΈ for comprehensive web data extraction