🌐 Universal AI Scraper

A Python-based web scraping toolkit that crawls websites and converts pages to clean Markdown format, perfect for building knowledge bases or training LLMs.

✨ Features

Scraper (`scraper.py`)

🔍 BFS-based web crawling - Intelligently discovers all pages on a website
🎯 Interactive configuration - User-friendly prompts for all settings
📊 Progress tracking - Real-time progress bars with tqdm
💾 Resume capability - Skip already-scraped pages automatically
🎛️ Depth control - Limit how deep to crawl (useful for large sites)
🔎 URL filtering - Only scrape pages matching specific patterns
📝 Native HTML-to-Markdown - No external dependencies like crwl
⚡ Polite scraping - Built-in delays between requests
📈 Metadata tracking - JSON file with crawl statistics and errors

Cleaner (`cleaner.py`)

🧹 Smart cleaning - Removes navigation, headers, footers automatically
🎨 Two modes - In-place or copy to separate directory
📊 Statistics - Shows size reduction and file counts
🎯 Interactive selection - Choose from scraped domains
📁 Custom paths - Clean any directory of markdown files
💡 LLM-optimized - Removes unnecessary content for AI processing

📦 Installation

Clone or download this repository
Create a virtual environment:
```
python3 -m venv venv
```
Activate the virtual environment:
```
source venv/bin/activate
```
Install dependencies:
```
pip install -r requirements.txt
```

🚀 Usage

Scraping a Website

Activate the virtual environment (if not already active):
```
source venv/bin/activate
```
Run the scraper:
```
./scraper.py
# or
python3 scraper.py
```
Follow the interactive prompts:
- Enter the URL to scrape (e.g., https://example.com)
- Set maximum crawl depth (0 for unlimited)
- Optionally filter URLs by pattern
- Confirm and start scraping
Optional: Clean the files after scraping when prompted

Example Scraper Session

🌐 Universal AI Scraper
======================================================================

Enter the URL to scrape: sitename.com

Output directory: /path/to/scrapes/sitename.com

──────────────────────────────────────────────────────────────────────
Maximum crawl depth (0 for unlimited) [0]: 2

──────────────────────────────────────────────────────────────────────
Optional: Filter URLs by pattern (e.g., '/docs/' to only scrape documentation)
URL pattern filter (press Enter to skip) []: /guides/

🔧 Configuration
======================================================================

URL: https://sitename.com
Output: /path/to/scrapes/sitename.com
Max depth: 2
URL filter: /guides/

Proceed with scraping? (y/n) [y]: y

🚀 Starting crawl of https://sitename.com
...

Cleaning Markdown Files

Activate the virtual environment (if not already active):
```
source venv/bin/activate
```
Run the cleaner:
```
./cleaner.py
# or
python3 cleaner.py
```
Follow the interactive prompts:
- Select from available scraped domains, or specify custom path
- Choose cleaning mode (copy or in-place)
- Confirm and start cleaning

Example Cleaner Session

🧹 Markdown Cleaner
======================================================================

Scanning for scraped content...

Found 2 scraped domain(s):
  1. sitename.com (45 files)
  2. example.com (12 files)

Select domain (1-2) [1]: 1

──────────────────────────────────────────────────────────────────────
Cleaning modes:
  1. Copy - Create cleaned copies in a 'cleaned' subdirectory (recommended)
  2. In-place - Overwrite original files

Clean in-place (overwrite originals)? (y/n) [n]: n

🔧 Configuration
======================================================================

Input directory: /path/to/scrapes/sitename.com
Mode: COPY

Proceed with cleaning? (y/n) [y]: y
...

📁 Directory Structure

After scraping and cleaning, your files will be organized like this:

project/
├── scraper.py              # Main scraping script
├── cleaner.py              # Standalone cleaning script
├── requirements.txt        # Python dependencies
├── README.md               # This file
├── venv/                   # Virtual environment (created)
└── scrapes/                # Scraped content (created)
    ├── example.com/
    │   ├── index.md
    │   ├── about.md
    │   ├── contact.md
    │   ├── _metadata.json  # Crawl statistics
    │   └── cleaned/        # Cleaned copies (if using copy mode)
    │       ├── index.md
    │       ├── about.md
    │       └── contact.md
    └── another-site.com/
        └── ...

🎯 Common Use Cases

1. Scrape an Entire Website

./scraper.py
# Enter: https://docs.example.com
# Depth: 0 (unlimited)
# Filter: (leave empty)

2. Scrape Only Documentation Pages

./scraper.py
# Enter: https://example.com
# Depth: 0
# Filter: /docs/

3. Shallow Scrape (Max 2 Levels Deep)

./scraper.py
# Enter: https://example.com
# Depth: 2
# Filter: (leave empty)

4. Resume an Interrupted Scrape

Just run the scraper again with the same URL - it will automatically skip already-scraped pages!

5. Clean Multiple Domains

./cleaner.py
# Select domain 1, clean in copy mode
./cleaner.py
# Select domain 2, clean in copy mode

🔧 Configuration Tips

Crawl Depth

0 (unlimited): Crawls the entire website
1: Only pages linked from the homepage
2: Homepage + pages 2 clicks away
3+: Deeper levels (use for large sites)

URL Filtering

Use URL patterns to scrape specific sections:

/blog/ - Only blog posts
/docs/ - Only documentation
/api/ - Only API reference
Leave empty to scrape everything

Cleaning Modes

Copy mode (recommended): Keeps originals, creates cleaned versions in cleaned/ subdirectory
In-place mode: Overwrites original files (save disk space)

📊 Metadata

Each scrape creates a _metadata.json file with:

Timestamp of the crawl
Starting URL and base URL
Configuration (depth, filters)
Statistics (successful, failed, skipped)
List of all scraped pages
List of failed pages with error messages

Example:

{
  "timestamp": "2024-10-09T08:30:00",
  "start_url": "https://example.com",
  "total_discovered": 150,
  "successful": 145,
  "failed": 5,
  "successful_pages": [
    {"url": "https://example.com/page1", "filename": "page1.md"},
    ...
  ]
}

🛠️ Troubleshooting

ImportError: No module named 'requests'

Make sure you've activated the virtual environment:

source venv/bin/activate

Permission denied when running scripts

Make scripts executable:

chmod +x scraper.py cleaner.py

Scraper misses some pages

Try increasing the crawl depth or removing URL filters.

Too many files scraped

Use URL filtering to limit to specific sections, or reduce crawl depth.

Script hangs or is very slow

Some websites have many pages. The scraper adds delays between requests (0.5-1s) to be polite. This is intentional!

🤝 Contributing

Feel free to modify these scripts for your specific needs! Some ideas:

Add sitemap.xml parsing for faster discovery
Implement robots.txt respect
Add concurrent requests (with rate limiting)
Custom cleaning rules for specific websites
Export to different formats (JSON, TXT, etc.)

📝 License

These scripts are provided as-is for educational and personal use.

🙏 Credits

Built with:

requests - HTTP library
BeautifulSoup4 - HTML parsing
html2text - HTML to Markdown conversion
tqdm - Progress bars

Happy Scraping! 🎉

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
.pylintrc		.pylintrc
QUICKSTART.md		QUICKSTART.md
README.md		README.md
TEST.md		TEST.md
WARP.md		WARP.md
cleaner.py		cleaner.py
requirements.txt		requirements.txt
scraper.py		scraper.py

Lendersmark/universal-ai-scraper

Folders and files

Latest commit

History

Repository files navigation