A Python web scraper designed to extract all content from BetterDocs-powered documentation websites. Automatically discovers categories and articles, then exports data in JSON, Markdown, and CSV formats.
- π Automatic Discovery - Automatically finds all categories and articles
- π Progress Tracking - Real-time progress bars for scraping status
- π‘οΈ Error Handling - Retry logic with exponential backoff
- β±οΈ Rate Limiting - Respectful delays between requests
- π¦ Multiple Export Formats - JSON, Markdown, and CSV outputs
- π― Clean Data - Extracts structured content without HTML clutter
- Python 3.7 or higher
- Internet connection
- Clone this repository:
git clone https://github.com/yourusername/knowledge-base-scraper.git
cd knowledge-base-scraper- Install dependencies:
pip install -r requirements.txt- Configure your target site by editing
scraper.py:
# Open scraper.py and change the base_url in the main() function:
def main():
scraper = KnowledgeBaseScraper(base_url="https://your-docs-site.com")
scraper.scrape_all()
scraper.export_all()- Run the scraper:
python scraper.pyThe scraper will:
- π Discover all categories from the main docs page
- π Extract article URLs from each category
- πΎ Scrape content from each article
- π¦ Export data in multiple formats to the
output/folder
π Starting scrape of https://your-docs-site.com/docs/
π Discovering categories...
β
Found 13 categories
π Processing categories: 100%|ββββββββββββ| 13/13
π Category Name: 100%|ββββββββββββ| 6/6
β
Scraping complete!
Categories: 13
Total Articles: 39
π¦ Exporting data...
πΎ JSON exported to: output\knowledge_base.json
πΎ Markdown files exported to: output\markdown
πΎ CSV exported to: output\knowledge_base.csv
β
All exports complete!
π Done! Check the 'output' folder for results.
Structured JSON containing all categories and articles (URLs excluded by default):
{
"categories": [
{
"name": "Category Name",
"articles": [
{
"title": "Article Title",
"parent_category": "Category Name",
"content": "Full article content..."
}
]
}
]
}Note: URLs are excluded by default. To include them, use:
scraper.export_all(include_urls=True)Individual markdown files organized by category:
output/markdown/
βββ policy/
β βββ student-ambassador.md
β βββ device-policy.md
βββ extensions/
β βββ ...
βββ ...
Each markdown file contains:
- Article title
- Parent category name
- Full article content
Note: URLs are excluded by default. To include them in markdown files, use include_urls=True.
Tabular format with columns (URLs excluded by default):
- Parent Category
- Title
- Content
With URLs enabled (include_urls=True):
- Parent Category
- Title
- URL
- Content
Perfect for importing into spreadsheets or databases.
You can customize the scraper by editing scraper.py:
# Change the base URL
scraper = KnowledgeBaseScraper(base_url="https://your-site.com")
# Adjust rate limiting (in seconds)
time.sleep(1) # Between categories
time.sleep(0.5) # Between articles
# Change output directory
scraper.export_all(output_dir="custom_output")
# Include URLs in exports (excluded by default)
scraper.export_all(include_urls=True)
# Modify retry attempts
html = self.get_page(url, retries=5)knowledge-base-scraper/
βββ scraper.py # Main scraper implementation
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ output/ # Generated output (created on first run)
βββ knowledge_base.json
βββ knowledge_base.csv
βββ markdown/
- Category Discovery: Scrapes the main docs page to find all category links
- Article Extraction: For each category, extracts all article URLs
- Content Scraping: Visits each article page and extracts:
- Title
- Full text content
- Parent category information
- URL (optional)
- Export: Saves data in JSON, Markdown, and CSV formats (URLs excluded by default)
Open scraper.py and modify the main() function:
def main():
# Change this to your documentation site
scraper = KnowledgeBaseScraper(base_url="https://your-docs-site.com")
scraper.scrape_all()
scraper.export_all()Run the scraper to see if it works with your site:
python scraper.pyIf the scraper doesn't work perfectly, you may need to adjust:
URL Patterns - If your site uses custom URLs:
# In extract_categories() method, change:
category_links = soup.find_all('a', href=lambda x: x and '/your-custom-path/' in x)CSS Selectors - If your site uses custom HTML classes:
# In extract_article_content() method, add your custom class:
content_div = (
soup.find('div', class_='your-custom-content-class') or
soup.find('div', class_='betterdocs-content') or
soup.find('article')
)Docs Path - If your docs aren't at /docs/:
# In __init__ method, change:
self.docs_url = f"{base_url}/documentation/" # or your custom pathSee the ADAPTATION_GUIDE.md for detailed instructions on:
- Testing compatibility with your site
- Common issues and solutions
- Advanced customization options
- Step-by-step debugging process
- Scraping Speed: Varies by site size (e.g., ~1-2 minutes for 39 articles)
- Memory Usage: Minimal - processes articles sequentially
- Output Size: Depends on content volume
- JSON: Compact without URLs
- CSV: Lightweight tabular format
- Markdown: Individual files per article
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
This scraper is intended for personal use and educational purposes. Always respect the website's robots.txt and terms of service. The scraper includes rate limiting to be respectful to the server.
If you find this tool useful, consider buying me a coffee!
Your support helps me create more open-source tools! π
- Built with BeautifulSoup for HTML parsing
- Progress bars powered by tqdm
- Designed for BetterDocs documentation sites
For questions or suggestions, please open an issue on GitHub.
Made with β€οΈ for documentation enthusiasts