Skip to content

Fast, easy scraper to fecth any website page and transform it into .md files for easy AI agent knowledge base creation.

Notifications You must be signed in to change notification settings

Lendersmark/universal-ai-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌐 Universal AI Scraper

A Python-based web scraping toolkit that crawls websites and converts pages to clean Markdown format, perfect for building knowledge bases or training LLMs.

✨ Features

Scraper (scraper.py)

  • πŸ” BFS-based web crawling - Intelligently discovers all pages on a website
  • 🎯 Interactive configuration - User-friendly prompts for all settings
  • πŸ“Š Progress tracking - Real-time progress bars with tqdm
  • πŸ’Ύ Resume capability - Skip already-scraped pages automatically
  • πŸŽ›οΈ Depth control - Limit how deep to crawl (useful for large sites)
  • πŸ”Ž URL filtering - Only scrape pages matching specific patterns
  • πŸ“ Native HTML-to-Markdown - No external dependencies like crwl
  • ⚑ Polite scraping - Built-in delays between requests
  • πŸ“ˆ Metadata tracking - JSON file with crawl statistics and errors

Cleaner (cleaner.py)

  • 🧹 Smart cleaning - Removes navigation, headers, footers automatically
  • 🎨 Two modes - In-place or copy to separate directory
  • πŸ“Š Statistics - Shows size reduction and file counts
  • 🎯 Interactive selection - Choose from scraped domains
  • πŸ“ Custom paths - Clean any directory of markdown files
  • πŸ’‘ LLM-optimized - Removes unnecessary content for AI processing

πŸ“¦ Installation

  1. Clone or download this repository

  2. Create a virtual environment:

    python3 -m venv venv
  3. Activate the virtual environment:

    source venv/bin/activate
  4. Install dependencies:

    pip install -r requirements.txt

πŸš€ Usage

Scraping a Website

  1. Activate the virtual environment (if not already active):

    source venv/bin/activate
  2. Run the scraper:

    ./scraper.py
    # or
    python3 scraper.py
  3. Follow the interactive prompts:

    • Enter the URL to scrape (e.g., https://example.com)
    • Set maximum crawl depth (0 for unlimited)
    • Optionally filter URLs by pattern
    • Confirm and start scraping
  4. Optional: Clean the files after scraping when prompted

Example Scraper Session

🌐 Universal AI Scraper
======================================================================

Enter the URL to scrape: sitename.com

Output directory: /path/to/scrapes/sitename.com

──────────────────────────────────────────────────────────────────────
Maximum crawl depth (0 for unlimited) [0]: 2

──────────────────────────────────────────────────────────────────────
Optional: Filter URLs by pattern (e.g., '/docs/' to only scrape documentation)
URL pattern filter (press Enter to skip) []: /guides/

πŸ”§ Configuration
======================================================================

URL: https://sitename.com
Output: /path/to/scrapes/sitename.com
Max depth: 2
URL filter: /guides/

Proceed with scraping? (y/n) [y]: y

πŸš€ Starting crawl of https://sitename.com
...

Cleaning Markdown Files

  1. Activate the virtual environment (if not already active):

    source venv/bin/activate
  2. Run the cleaner:

    ./cleaner.py
    # or
    python3 cleaner.py
  3. Follow the interactive prompts:

    • Select from available scraped domains, or specify custom path
    • Choose cleaning mode (copy or in-place)
    • Confirm and start cleaning

Example Cleaner Session

🧹 Markdown Cleaner
======================================================================

Scanning for scraped content...

Found 2 scraped domain(s):
  1. sitename.com (45 files)
  2. example.com (12 files)

Select domain (1-2) [1]: 1

──────────────────────────────────────────────────────────────────────
Cleaning modes:
  1. Copy - Create cleaned copies in a 'cleaned' subdirectory (recommended)
  2. In-place - Overwrite original files

Clean in-place (overwrite originals)? (y/n) [n]: n

πŸ”§ Configuration
======================================================================

Input directory: /path/to/scrapes/sitename.com
Mode: COPY

Proceed with cleaning? (y/n) [y]: y
...

πŸ“ Directory Structure

After scraping and cleaning, your files will be organized like this:

project/
β”œβ”€β”€ scraper.py              # Main scraping script
β”œβ”€β”€ cleaner.py              # Standalone cleaning script
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ README.md               # This file
β”œβ”€β”€ venv/                   # Virtual environment (created)
└── scrapes/                # Scraped content (created)
    β”œβ”€β”€ example.com/
    β”‚   β”œβ”€β”€ index.md
    β”‚   β”œβ”€β”€ about.md
    β”‚   β”œβ”€β”€ contact.md
    β”‚   β”œβ”€β”€ _metadata.json  # Crawl statistics
    β”‚   └── cleaned/        # Cleaned copies (if using copy mode)
    β”‚       β”œβ”€β”€ index.md
    β”‚       β”œβ”€β”€ about.md
    β”‚       └── contact.md
    └── another-site.com/
        └── ...

🎯 Common Use Cases

1. Scrape an Entire Website

./scraper.py
# Enter: https://docs.example.com
# Depth: 0 (unlimited)
# Filter: (leave empty)

2. Scrape Only Documentation Pages

./scraper.py
# Enter: https://example.com
# Depth: 0
# Filter: /docs/

3. Shallow Scrape (Max 2 Levels Deep)

./scraper.py
# Enter: https://example.com
# Depth: 2
# Filter: (leave empty)

4. Resume an Interrupted Scrape

Just run the scraper again with the same URL - it will automatically skip already-scraped pages!

5. Clean Multiple Domains

./cleaner.py
# Select domain 1, clean in copy mode
./cleaner.py
# Select domain 2, clean in copy mode

πŸ”§ Configuration Tips

Crawl Depth

  • 0 (unlimited): Crawls the entire website
  • 1: Only pages linked from the homepage
  • 2: Homepage + pages 2 clicks away
  • 3+: Deeper levels (use for large sites)

URL Filtering

Use URL patterns to scrape specific sections:

  • /blog/ - Only blog posts
  • /docs/ - Only documentation
  • /api/ - Only API reference
  • Leave empty to scrape everything

Cleaning Modes

  • Copy mode (recommended): Keeps originals, creates cleaned versions in cleaned/ subdirectory
  • In-place mode: Overwrites original files (save disk space)

πŸ“Š Metadata

Each scrape creates a _metadata.json file with:

  • Timestamp of the crawl
  • Starting URL and base URL
  • Configuration (depth, filters)
  • Statistics (successful, failed, skipped)
  • List of all scraped pages
  • List of failed pages with error messages

Example:

{
  "timestamp": "2024-10-09T08:30:00",
  "start_url": "https://example.com",
  "total_discovered": 150,
  "successful": 145,
  "failed": 5,
  "successful_pages": [
    {"url": "https://example.com/page1", "filename": "page1.md"},
    ...
  ]
}

πŸ› οΈ Troubleshooting

ImportError: No module named 'requests'

Make sure you've activated the virtual environment:

source venv/bin/activate

Permission denied when running scripts

Make scripts executable:

chmod +x scraper.py cleaner.py

Scraper misses some pages

Try increasing the crawl depth or removing URL filters.

Too many files scraped

Use URL filtering to limit to specific sections, or reduce crawl depth.

Script hangs or is very slow

Some websites have many pages. The scraper adds delays between requests (0.5-1s) to be polite. This is intentional!

🀝 Contributing

Feel free to modify these scripts for your specific needs! Some ideas:

  • Add sitemap.xml parsing for faster discovery
  • Implement robots.txt respect
  • Add concurrent requests (with rate limiting)
  • Custom cleaning rules for specific websites
  • Export to different formats (JSON, TXT, etc.)

πŸ“ License

These scripts are provided as-is for educational and personal use.

πŸ™ Credits

Built with:


Happy Scraping! πŸŽ‰

About

Fast, easy scraper to fecth any website page and transform it into .md files for easy AI agent knowledge base creation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages