A Python-based web scraping toolkit that crawls websites and converts pages to clean Markdown format, perfect for building knowledge bases or training LLMs.
- π BFS-based web crawling - Intelligently discovers all pages on a website
- π― Interactive configuration - User-friendly prompts for all settings
- π Progress tracking - Real-time progress bars with
tqdm - πΎ Resume capability - Skip already-scraped pages automatically
- ποΈ Depth control - Limit how deep to crawl (useful for large sites)
- π URL filtering - Only scrape pages matching specific patterns
- π Native HTML-to-Markdown - No external dependencies like
crwl - β‘ Polite scraping - Built-in delays between requests
- π Metadata tracking - JSON file with crawl statistics and errors
- π§Ή Smart cleaning - Removes navigation, headers, footers automatically
- π¨ Two modes - In-place or copy to separate directory
- π Statistics - Shows size reduction and file counts
- π― Interactive selection - Choose from scraped domains
- π Custom paths - Clean any directory of markdown files
- π‘ LLM-optimized - Removes unnecessary content for AI processing
-
Clone or download this repository
-
Create a virtual environment:
python3 -m venv venv
-
Activate the virtual environment:
source venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Activate the virtual environment (if not already active):
source venv/bin/activate -
Run the scraper:
./scraper.py # or python3 scraper.py -
Follow the interactive prompts:
- Enter the URL to scrape (e.g.,
https://example.com) - Set maximum crawl depth (0 for unlimited)
- Optionally filter URLs by pattern
- Confirm and start scraping
- Enter the URL to scrape (e.g.,
-
Optional: Clean the files after scraping when prompted
π Universal AI Scraper
======================================================================
Enter the URL to scrape: sitename.com
Output directory: /path/to/scrapes/sitename.com
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Maximum crawl depth (0 for unlimited) [0]: 2
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Optional: Filter URLs by pattern (e.g., '/docs/' to only scrape documentation)
URL pattern filter (press Enter to skip) []: /guides/
π§ Configuration
======================================================================
URL: https://sitename.com
Output: /path/to/scrapes/sitename.com
Max depth: 2
URL filter: /guides/
Proceed with scraping? (y/n) [y]: y
π Starting crawl of https://sitename.com
...
-
Activate the virtual environment (if not already active):
source venv/bin/activate -
Run the cleaner:
./cleaner.py # or python3 cleaner.py -
Follow the interactive prompts:
- Select from available scraped domains, or specify custom path
- Choose cleaning mode (copy or in-place)
- Confirm and start cleaning
π§Ή Markdown Cleaner
======================================================================
Scanning for scraped content...
Found 2 scraped domain(s):
1. sitename.com (45 files)
2. example.com (12 files)
Select domain (1-2) [1]: 1
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cleaning modes:
1. Copy - Create cleaned copies in a 'cleaned' subdirectory (recommended)
2. In-place - Overwrite original files
Clean in-place (overwrite originals)? (y/n) [n]: n
π§ Configuration
======================================================================
Input directory: /path/to/scrapes/sitename.com
Mode: COPY
Proceed with cleaning? (y/n) [y]: y
...
After scraping and cleaning, your files will be organized like this:
project/
βββ scraper.py # Main scraping script
βββ cleaner.py # Standalone cleaning script
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ venv/ # Virtual environment (created)
βββ scrapes/ # Scraped content (created)
βββ example.com/
β βββ index.md
β βββ about.md
β βββ contact.md
β βββ _metadata.json # Crawl statistics
β βββ cleaned/ # Cleaned copies (if using copy mode)
β βββ index.md
β βββ about.md
β βββ contact.md
βββ another-site.com/
βββ ...
./scraper.py
# Enter: https://docs.example.com
# Depth: 0 (unlimited)
# Filter: (leave empty)./scraper.py
# Enter: https://example.com
# Depth: 0
# Filter: /docs/./scraper.py
# Enter: https://example.com
# Depth: 2
# Filter: (leave empty)Just run the scraper again with the same URL - it will automatically skip already-scraped pages!
./cleaner.py
# Select domain 1, clean in copy mode
./cleaner.py
# Select domain 2, clean in copy mode- 0 (unlimited): Crawls the entire website
- 1: Only pages linked from the homepage
- 2: Homepage + pages 2 clicks away
- 3+: Deeper levels (use for large sites)
Use URL patterns to scrape specific sections:
/blog/- Only blog posts/docs/- Only documentation/api/- Only API reference- Leave empty to scrape everything
- Copy mode (recommended): Keeps originals, creates cleaned versions in
cleaned/subdirectory - In-place mode: Overwrites original files (save disk space)
Each scrape creates a _metadata.json file with:
- Timestamp of the crawl
- Starting URL and base URL
- Configuration (depth, filters)
- Statistics (successful, failed, skipped)
- List of all scraped pages
- List of failed pages with error messages
Example:
{
"timestamp": "2024-10-09T08:30:00",
"start_url": "https://example.com",
"total_discovered": 150,
"successful": 145,
"failed": 5,
"successful_pages": [
{"url": "https://example.com/page1", "filename": "page1.md"},
...
]
}Make sure you've activated the virtual environment:
source venv/bin/activateMake scripts executable:
chmod +x scraper.py cleaner.pyTry increasing the crawl depth or removing URL filters.
Use URL filtering to limit to specific sections, or reduce crawl depth.
Some websites have many pages. The scraper adds delays between requests (0.5-1s) to be polite. This is intentional!
Feel free to modify these scripts for your specific needs! Some ideas:
- Add sitemap.xml parsing for faster discovery
- Implement robots.txt respect
- Add concurrent requests (with rate limiting)
- Custom cleaning rules for specific websites
- Export to different formats (JSON, TXT, etc.)
These scripts are provided as-is for educational and personal use.
Built with:
- requests - HTTP library
- BeautifulSoup4 - HTML parsing
- html2text - HTML to Markdown conversion
- tqdm - Progress bars
Happy Scraping! π