Skip to content

gunjanjaswal/BetterDocs-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Knowledge Base Web Scraper

A Python web scraper designed to extract all content from BetterDocs-powered documentation websites. Automatically discovers categories and articles, then exports data in JSON, Markdown, and CSV formats.

πŸš€ Features

  • πŸ” Automatic Discovery - Automatically finds all categories and articles
  • πŸ“Š Progress Tracking - Real-time progress bars for scraping status
  • πŸ›‘οΈ Error Handling - Retry logic with exponential backoff
  • ⏱️ Rate Limiting - Respectful delays between requests
  • πŸ“¦ Multiple Export Formats - JSON, Markdown, and CSV outputs
  • 🎯 Clean Data - Extracts structured content without HTML clutter

πŸ“‹ Requirements

  • Python 3.7 or higher
  • Internet connection

πŸ”§ Installation

  1. Clone this repository:
git clone https://github.com/yourusername/knowledge-base-scraper.git
cd knowledge-base-scraper
  1. Install dependencies:
pip install -r requirements.txt

πŸ’» Usage

Quick Start

  1. Configure your target site by editing scraper.py:
# Open scraper.py and change the base_url in the main() function:
def main():
    scraper = KnowledgeBaseScraper(base_url="https://your-docs-site.com")
    scraper.scrape_all()
    scraper.export_all()
  1. Run the scraper:
python scraper.py

The scraper will:

  1. πŸ” Discover all categories from the main docs page
  2. πŸ“„ Extract article URLs from each category
  3. πŸ’Ύ Scrape content from each article
  4. πŸ“¦ Export data in multiple formats to the output/ folder

Example Output

πŸš€ Starting scrape of https://your-docs-site.com/docs/

πŸ” Discovering categories...
βœ… Found 13 categories

πŸ“š Processing categories: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13
  πŸ“„ Category Name: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6/6

βœ… Scraping complete!
   Categories: 13
   Total Articles: 39

πŸ“¦ Exporting data...

πŸ’Ύ JSON exported to: output\knowledge_base.json
πŸ’Ύ Markdown files exported to: output\markdown
πŸ’Ύ CSV exported to: output\knowledge_base.csv

βœ… All exports complete!

πŸŽ‰ Done! Check the 'output' folder for results.

πŸ“ Output Formats

JSON (output/knowledge_base.json)

Structured JSON containing all categories and articles (URLs excluded by default):

{
  "categories": [
    {
      "name": "Category Name",
      "articles": [
        {
          "title": "Article Title",
          "parent_category": "Category Name",
          "content": "Full article content..."
        }
      ]
    }
  ]
}

Note: URLs are excluded by default. To include them, use:

scraper.export_all(include_urls=True)

Markdown (output/markdown/)

Individual markdown files organized by category:

output/markdown/
β”œβ”€β”€ policy/
β”‚   β”œβ”€β”€ student-ambassador.md
β”‚   └── device-policy.md
β”œβ”€β”€ extensions/
β”‚   └── ...
└── ...

Each markdown file contains:

  • Article title
  • Parent category name
  • Full article content

Note: URLs are excluded by default. To include them in markdown files, use include_urls=True.

CSV (output/knowledge_base.csv)

Tabular format with columns (URLs excluded by default):

  • Parent Category
  • Title
  • Content

With URLs enabled (include_urls=True):

  • Parent Category
  • Title
  • URL
  • Content

Perfect for importing into spreadsheets or databases.

βš™οΈ Configuration

You can customize the scraper by editing scraper.py:

# Change the base URL
scraper = KnowledgeBaseScraper(base_url="https://your-site.com")

# Adjust rate limiting (in seconds)
time.sleep(1)  # Between categories
time.sleep(0.5)  # Between articles

# Change output directory
scraper.export_all(output_dir="custom_output")

# Include URLs in exports (excluded by default)
scraper.export_all(include_urls=True)

# Modify retry attempts
html = self.get_page(url, retries=5)

πŸ—οΈ Project Structure

knowledge-base-scraper/
β”œβ”€β”€ scraper.py          # Main scraper implementation
β”œβ”€β”€ requirements.txt    # Python dependencies
β”œβ”€β”€ README.md          # This file
└── output/            # Generated output (created on first run)
    β”œβ”€β”€ knowledge_base.json
    β”œβ”€β”€ knowledge_base.csv
    └── markdown/

πŸ” How It Works

  1. Category Discovery: Scrapes the main docs page to find all category links
  2. Article Extraction: For each category, extracts all article URLs
  3. Content Scraping: Visits each article page and extracts:
    • Title
    • Full text content
    • Parent category information
    • URL (optional)
  4. Export: Saves data in JSON, Markdown, and CSV formats (URLs excluded by default)

πŸ”§ Using with Your BetterDocs Site

Step 1: Update the Base URL

Open scraper.py and modify the main() function:

def main():
    # Change this to your documentation site
    scraper = KnowledgeBaseScraper(base_url="https://your-docs-site.com")
    scraper.scrape_all()
    scraper.export_all()

Step 2: Test the Scraper

Run the scraper to see if it works with your site:

python scraper.py

Step 3: Adjust if Needed

If the scraper doesn't work perfectly, you may need to adjust:

URL Patterns - If your site uses custom URLs:

# In extract_categories() method, change:
category_links = soup.find_all('a', href=lambda x: x and '/your-custom-path/' in x)

CSS Selectors - If your site uses custom HTML classes:

# In extract_article_content() method, add your custom class:
content_div = (
    soup.find('div', class_='your-custom-content-class') or
    soup.find('div', class_='betterdocs-content') or
    soup.find('article')
)

Docs Path - If your docs aren't at /docs/:

# In __init__ method, change:
self.docs_url = f"{base_url}/documentation/"  # or your custom path

Need More Help?

See the ADAPTATION_GUIDE.md for detailed instructions on:

  • Testing compatibility with your site
  • Common issues and solutions
  • Advanced customization options
  • Step-by-step debugging process

πŸ“Š Performance

  • Scraping Speed: Varies by site size (e.g., ~1-2 minutes for 39 articles)
  • Memory Usage: Minimal - processes articles sequentially
  • Output Size: Depends on content volume
    • JSON: Compact without URLs
    • CSV: Lightweight tabular format
    • Markdown: Individual files per article

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Disclaimer

This scraper is intended for personal use and educational purposes. Always respect the website's robots.txt and terms of service. The scraper includes rate limiting to be respectful to the server.

β˜• Support

If you find this tool useful, consider buying me a coffee!

Buy Me A Coffee

Your support helps me create more open-source tools! πŸ™

πŸ™ Acknowledgments

πŸ“§ Contact

For questions or suggestions, please open an issue on GitHub.


Made with ❀️ for documentation enthusiasts

Releases

No releases published

Packages

 
 
 

Contributors

Languages