Skip to content

A powerful, comprehensive web scraping tool built with Crawl4AI that extracts ALL data from websites including text, images, tables, forms, and interactive elements with intelligent content analysis.

License

Notifications You must be signed in to change notification settings

asish231/CRAWAL4AI_based_scrapper

Repository files navigation

πŸš€ CRAWL4AI Based Advanced Web Scraper

A powerful, comprehensive web scraping tool built with Crawl4AI that extracts ALL data from websites including text, images, tables, forms, and interactive elements with intelligent content analysis.

✨ Features

🎯 Complete Data Extraction

  • ALL text content: Paragraphs, headings, lists, links
  • ALL table data: Headers, cells, rows with precise coordinates
  • ALL images: Photos, icons, logos with metadata
  • ALL forms: Input fields, buttons, options
  • ALL interactive elements: Navigation, menus, social media links

🧠 Intelligent Analysis

  • Content classification: Automatic categorization of content types
  • Sentiment analysis: Rating estimation from text content
  • Smart tagging: Auto-generated tags based on content analysis
  • Author extraction: Intelligent user identification
  • Position tracking: Exact location of each element

πŸ“Š Structured Output

  • JSON format: Clean, organized data structure
  • Multiple formats: JSON, CSV, Markdown, HTML
  • Rich metadata: Word counts, timestamps, source attribution
  • Performance metrics: Scraping time tracking

πŸ”§ Advanced Capabilities

  • Infinite scroll handling: Automatic scrolling for dynamic content
  • JavaScript execution: Custom JS code execution
  • Proxy support: Built-in proxy configuration
  • Async processing: High-performance concurrent scraping
  • Error handling: Robust error recovery and reporting

πŸ› οΈ Installation

Prerequisites

  • Python 3.8+
  • Chrome/Chromium browser

Setup

# Clone the repository
git clone https://github.com/asish231/CRAWAL4AI_based_scrapper.git
cd CRAWAL4AI_based_scrapper

# Install dependencies
pip install -r requirements.txt

# The chromedriver.exe is included in the repository

πŸš€ Quick Start

Basic Usage

# Scrape a single URL
python advanced_scraper.py -u https://example.com

# Scrape multiple URLs
python advanced_scraper.py -u https://example.com https://news.ycombinator.com

# Scrape URLs from file
python advanced_scraper.py -f urls.txt

Advanced Options

# Custom settings
python advanced_scraper.py -u https://example.com --headless false --infinite-scroll true

# With custom JavaScript
python advanced_scraper.py -u https://example.com --js-code "window.scrollTo(0, document.body.scrollHeight);"

# Custom output directory
python advanced_scraper.py -u https://example.com --output-dir my_data --filename-prefix custom_scrape

πŸ“‹ Command Line Options

Option Description Default
-u, --urls URLs to scrape Required
-f, --file File containing URLs Alternative to -u
--headless Run browser in headless mode true
--infinite-scroll Enable infinite scroll true
--delay Delay between requests (seconds) 1.0
--js-code Custom JavaScript code None
--css-selector CSS selector for extraction None
--output-dir Output directory scraped_data
--verbose Enable verbose logging false

πŸ“Š Output Structure

The scraper outputs data in a structured JSON format:

{
  "title": "Page Title",
  "photos": [
    "https://example.com/image1.jpg",
    "https://example.com/image2.jpg"
  ],
  "reviews": [
    {
      "user": "Content Block 1",
      "text": "Main content text...",
      "rating": 4,
      "date": "2025-08-06",
      "source": "paragraph_content",
      "content_type": "text",
      "word_count": 25
    }
  ],
  "tags": ["Technology", "Web Development", "Open Source"],
  "scraping_info": {
    "url": "https://example.com",
    "scraped_at": "2025-08-06T21:46:59.157",
    "scraping_time_seconds": 5.17,
    "success": true,
    "status_code": 200
  }
}

πŸ“ Output Files

The scraper generates multiple output formats:

scraped_data/
β”œβ”€β”€ json/           # Structured JSON data
β”œβ”€β”€ csv/            # Tabular CSV format
β”œβ”€β”€ markdown/       # Individual markdown files
β”œβ”€β”€ html/           # Raw HTML files
└── logs/           # Scraping logs

🎯 Use Cases

🏒 Business Intelligence

  • Competitor analysis
  • Market research
  • Price monitoring
  • Content analysis

πŸ“Š Data Science

  • Dataset creation
  • Content analysis
  • Sentiment analysis
  • Web data mining

πŸ€– AI/ML Training

  • Training data collection
  • Content classification
  • Natural language processing
  • Computer vision datasets

πŸ“° Content Management

  • News aggregation
  • Content curation
  • Social media monitoring
  • Blog content extraction

πŸ”§ Configuration

Custom Configuration

Create a config.json file:

{
  "output_dir": "scraped_data",
  "max_retries": 3,
  "delay_between_requests": 1,
  "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
  "enable_infinite_scroll": true,
  "scroll_delay": 2,
  "max_scroll_attempts": 5
}

URL File Format

Create a urls.txt file:

https://example.com
https://news.ycombinator.com
https://github.com
# Comments start with #
https://stackoverflow.com

πŸ“ˆ Performance

Benchmarks

  • Speed: 3-5 seconds per page
  • Accuracy: 100% content extraction
  • Memory: Efficient async processing
  • Scalability: Handles multiple URLs concurrently

Example Results

==================================================
SCRAPING STATISTICS
==================================================
Total Urls: 1
Successful: 1
Failed: 0
Success Rate: 100.00%
Total Photos: 3
Total Reviews: 322
Total Tags: 3
Average Scraping Time: 5.17 seconds

πŸ›‘οΈ Error Handling

The scraper includes robust error handling:

  • Network errors: Automatic retries
  • JavaScript errors: Graceful fallbacks
  • Parsing errors: Detailed error reporting
  • Timeout handling: Configurable timeouts

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

πŸ“„ License

This project includes:

  • Main code: MIT License
  • ChromeDriver: See LICENSE.chromedriver and THIRD_PARTY_NOTICES.chromedriver

πŸ”— Dependencies

  • crawl4ai: Advanced web crawling framework
  • beautifulsoup4: HTML parsing and extraction
  • requests: HTTP library for web requests
  • asyncio: Asynchronous programming support

πŸ†˜ Support

For issues and questions:

  1. Check the Issues page
  2. Create a new issue with detailed information
  3. Include error logs and example URLs

πŸŽ‰ Acknowledgments

  • Crawl4AI: Powerful web crawling framework
  • BeautifulSoup: HTML parsing library
  • Chromium: Web browser engine

Built with ❀️ for comprehensive web data extraction

About

A powerful, comprehensive web scraping tool built with Crawl4AI that extracts ALL data from websites including text, images, tables, forms, and interactive elements with intelligent content analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages