Skip to content

Universal Playwright-based scraper that captures fully rendered HTML, screenshots, and iframe content — ideal for archiving or analyzing modern JavaScript-heavy websites.

License

Notifications You must be signed in to change notification settings

mdugan8186/raw-html-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

🕸️ raw-html-scraper

raw-html-scraper is a universal Playwright-based web rendering utility that captures the fully rendered HTML of any webpage — including JavaScript-hydrated, React/Next.js, or dynamically loaded content.
It was designed to work as both a general-purpose HTML collector and a foundational tool for web scraping, data extraction, and archival.

This script loads a webpage just like a real browser, scrolls through content, expands visible sections, and saves the final DOM snapshot — ensuring you capture what the user actually sees rather than the incomplete “View Page Source” version.


⚙️ Features

  • Universal HTML Capture — Works with static and dynamic (JS-heavy) websites
  • Auto-scroll Engine — Triggers lazy-loading and infinite-scroll sections
  • Expand Button Logic — Clicks common “Read more” or “Show more” elements automatically
  • Boilerplate Browser Profile — Persistent, sandboxed Chrome profile for authentication and anti-bot bypass
  • Timestamped Outputs — Each run creates unique files named by date (MM-DD-YYYY)
  • Iframe Capture — Optionally saves content from all embedded iframes (see below)
  • Screenshot Capture — Automatically saves a full-page PNG image of the rendered page
  • Analytics Blocking — Skips analytics and telemetry requests to improve speed
  • Cross-platform — macOS, Linux, and Windows compatible

🧰 Requirements

  • Python 3.9+
  • Playwright (installed with browsers):
    pip install playwright
    playwright install chromium

🚀 Usage

From the project directory (raw-html-scraper/):

python script.py "<URL>" [prefix]

Example

python script.py "https://www.fiverr.com/mdugan8186/build-a-custom-python-web-scraper" fiverr

This creates:

fiverr_10-30-2025.html
fiverr_10-30-2025.png
fiverr_10-30-2025_frame1.html
fiverr_10-30-2025_frame2.html
...

Each file is fully rendered HTML, not the raw server source.


🖼️ Screenshot Capture

A full-page PNG screenshot is automatically saved during each run.
The screenshot shows the page exactly as it was rendered in the browser after scrolling and JavaScript execution.

Screenshot Code Block

# ===== Screenshot section =====
# To enable screenshots, leave this code as-is.
# To disable screenshots, simply comment out the lines below.

png_name = get_unique_filename(f"{safe_prefix}_{timestamp}", "png")
page.screenshot(path=png_name, full_page=True)
print(f"[+] Saved screenshot -> {png_name}")

Notes

  • full_page=True captures the full scrollable height.
  • To capture only the visible viewport, remove full_page=True.
  • Files are timestamped for each run and saved in the same directory as the HTML.

🗂️ Output Structure

raw-html-scraper/
├── script.py
├── boilerplate_profile/      # Isolated browser data (cookies, cache, sessions)
├── fiverr_10-30-2025.html    # Main rendered HTML
├── fiverr_10-30-2025.png     # Screenshot (full page)
├── fiverr_10-30-2025_frame1.html
├── fiverr_10-30-2025_frame2.html
└── ...

🧱 Boilerplate Profile

A persistent browser profile is automatically created the first time you run the script.

Location (default):

raw-html-scraper/boilerplate_profile/

Purpose:

  • Stores cookies, cache, and local storage between runs.
  • Keeps you logged into sites without re-entering credentials.
  • Helps bypass bot detection by maintaining a consistent browser identity.
  • Fully isolated from your personal Chrome profile.

Reset anytime:

rm -rf boilerplate_profile

The folder will be recreated automatically on the next run.


🪟 Iframe Capture (Advanced)

Modern sites often embed sub-pages inside iframes — mini web documents loaded separately from the main page.
These can contain analytics trackers, widgets, or valuable data (videos, reviews, product info, etc.).

Your scraper can handle iframes in two different ways.


🧩 A. Capture All Iframes (Full Mode)

This version saves every iframe regardless of size or content.
It’s ideal for full site snapshots, auditing, or forensic capture.

def save_iframes(context, prefix, timestamp):
    """Save HTML content of all child frames (iframes), skipping the main frame."""
    count = 0
    for page in context.pages:
        frames = page.frames
        for i, frame in enumerate(frames[1:], start=1):
            try:
                html = frame.content()
                if html and len(html.strip()) > 0:
                    frame_file = get_unique_filename(f"{prefix}_{timestamp}_frame{i}", "html")
                    with open(frame_file, "w", encoding="utf-8") as f:
                        f.write(html)
                    print(f"[+] Saved iframe -> {frame_file}")
                    count += 1
            except Exception:
                continue
    return count

Pros

  • Captures all embedded frames (complete record).
  • Useful for debugging or investigating hidden elements.

Cons

  • May include analytics and ad iframes (clutter).
  • Slightly slower.

⚙️ B. Capture Only Meaningful Iframes (Filtered Mode)

This skips tiny or irrelevant frames (like trackers or analytics) by checking length.

def save_iframes(context, prefix, timestamp, min_length=5000):
    """Save only meaningful iframe HTMLs. Skips small analytics/tracking frames (< ~5,000 characters)."""
    count = 0
    for page in context.pages:
        frames = page.frames
        for i, frame in enumerate(frames[1:], start=1):
            try:
                html = frame.content()
                if not html or len(html.strip()) < min_length:
                    continue
                frame_file = get_unique_filename(f"{prefix}_{timestamp}_frame{i}", "html")
                with open(frame_file, "w", encoding="utf-8") as f:
                    f.write(html)
                print(f"[+] Saved iframe -> {frame_file} ({len(html)} chars)")
                count += 1
            except Exception:
                continue
    if count == 0:
        print("[+] No meaningful iframes captured (all skipped or empty).")
    return count

Pros

  • Keeps useful frames while filtering out noise.
  • Produces smaller, cleaner output.

Cons

  • Might miss legitimate small embeds (<5,000 chars).

🧠 Tips for Best Results

  1. Scroll Duration — Increase POST_SCROLL_SETTLE_MS (e.g., 2500–3500) for long or AJAX-heavy pages.
  2. Expand Buttons — Add selectors for “Show more” or “Load all” buttons.
  3. Wait Selectors — Use READY_SELECTORS for complex JS pages.
  4. Disable Analytics Noise — The script blocks tracking calls automatically.
  5. Screenshot Control — Comment out the screenshot lines to disable it.
  6. Profile Isolation — Keep the boilerplate_profile separate from your personal data.

📜 License

MIT License © 2025
Developed and maintained by Mike Dugan


👤 About

Mike Dugan — Python Web Scraper & Automation Developer

About

Universal Playwright-based scraper that captures fully rendered HTML, screenshots, and iframe content — ideal for archiving or analyzing modern JavaScript-heavy websites.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages