🕸️ raw-html-scraper

raw-html-scraper is a universal Playwright-based web rendering utility that captures the fully rendered HTML of any webpage — including JavaScript-hydrated, React/Next.js, or dynamically loaded content.
It was designed to work as both a general-purpose HTML collector and a foundational tool for web scraping, data extraction, and archival.

This script loads a webpage just like a real browser, scrolls through content, expands visible sections, and saves the final DOM snapshot — ensuring you capture what the user actually sees rather than the incomplete “View Page Source” version.

⚙️ Features

Universal HTML Capture — Works with static and dynamic (JS-heavy) websites
Auto-scroll Engine — Triggers lazy-loading and infinite-scroll sections
Expand Button Logic — Clicks common “Read more” or “Show more” elements automatically
Boilerplate Browser Profile — Persistent, sandboxed Chrome profile for authentication and anti-bot bypass
Timestamped Outputs — Each run creates unique files named by date (MM-DD-YYYY)
Iframe Capture — Optionally saves content from all embedded iframes (see below)
Screenshot Capture — Automatically saves a full-page PNG image of the rendered page
Analytics Blocking — Skips analytics and telemetry requests to improve speed
Cross-platform — macOS, Linux, and Windows compatible

🧰 Requirements

Python 3.9+

Playwright (installed with browsers):

pip install playwright
playwright install chromium

🚀 Usage

From the project directory (raw-html-scraper/):

python script.py "<URL>" [prefix]

Example

python script.py "https://www.fiverr.com/mdugan8186/build-a-custom-python-web-scraper" fiverr

This creates:

fiverr_10-30-2025.html
fiverr_10-30-2025.png
fiverr_10-30-2025_frame1.html
fiverr_10-30-2025_frame2.html
...

Each file is fully rendered HTML, not the raw server source.

🖼️ Screenshot Capture

A full-page PNG screenshot is automatically saved during each run.
The screenshot shows the page exactly as it was rendered in the browser after scrolling and JavaScript execution.

Screenshot Code Block

# ===== Screenshot section =====
# To enable screenshots, leave this code as-is.
# To disable screenshots, simply comment out the lines below.

png_name = get_unique_filename(f"{safe_prefix}_{timestamp}", "png")
page.screenshot(path=png_name, full_page=True)
print(f"[+] Saved screenshot -> {png_name}")

Notes

full_page=True captures the full scrollable height.
To capture only the visible viewport, remove full_page=True.
Files are timestamped for each run and saved in the same directory as the HTML.

🗂️ Output Structure

raw-html-scraper/
├── script.py
├── boilerplate_profile/      # Isolated browser data (cookies, cache, sessions)
├── fiverr_10-30-2025.html    # Main rendered HTML
├── fiverr_10-30-2025.png     # Screenshot (full page)
├── fiverr_10-30-2025_frame1.html
├── fiverr_10-30-2025_frame2.html
└── ...

🧱 Boilerplate Profile

A persistent browser profile is automatically created the first time you run the script.

Location (default):

raw-html-scraper/boilerplate_profile/

Purpose:

Stores cookies, cache, and local storage between runs.
Keeps you logged into sites without re-entering credentials.
Helps bypass bot detection by maintaining a consistent browser identity.
Fully isolated from your personal Chrome profile.

Reset anytime:

rm -rf boilerplate_profile

The folder will be recreated automatically on the next run.

🪟 Iframe Capture (Advanced)

Modern sites often embed sub-pages inside iframes — mini web documents loaded separately from the main page.
These can contain analytics trackers, widgets, or valuable data (videos, reviews, product info, etc.).

Your scraper can handle iframes in two different ways.

🧩 A. Capture All Iframes (Full Mode)

This version saves every iframe regardless of size or content.
It’s ideal for full site snapshots, auditing, or forensic capture.

def save_iframes(context, prefix, timestamp):
    """Save HTML content of all child frames (iframes), skipping the main frame."""
    count = 0
    for page in context.pages:
        frames = page.frames
        for i, frame in enumerate(frames[1:], start=1):
            try:
                html = frame.content()
                if html and len(html.strip()) > 0:
                    frame_file = get_unique_filename(f"{prefix}_{timestamp}_frame{i}", "html")
                    with open(frame_file, "w", encoding="utf-8") as f:
                        f.write(html)
                    print(f"[+] Saved iframe -> {frame_file}")
                    count += 1
            except Exception:
                continue
    return count

Pros

Captures all embedded frames (complete record).
Useful for debugging or investigating hidden elements.

Cons

May include analytics and ad iframes (clutter).
Slightly slower.

⚙️ B. Capture Only Meaningful Iframes (Filtered Mode)

This skips tiny or irrelevant frames (like trackers or analytics) by checking length.

def save_iframes(context, prefix, timestamp, min_length=5000):
    """Save only meaningful iframe HTMLs. Skips small analytics/tracking frames (< ~5,000 characters)."""
    count = 0
    for page in context.pages:
        frames = page.frames
        for i, frame in enumerate(frames[1:], start=1):
            try:
                html = frame.content()
                if not html or len(html.strip()) < min_length:
                    continue
                frame_file = get_unique_filename(f"{prefix}_{timestamp}_frame{i}", "html")
                with open(frame_file, "w", encoding="utf-8") as f:
                    f.write(html)
                print(f"[+] Saved iframe -> {frame_file} ({len(html)} chars)")
                count += 1
            except Exception:
                continue
    if count == 0:
        print("[+] No meaningful iframes captured (all skipped or empty).")
    return count

Pros

Keeps useful frames while filtering out noise.
Produces smaller, cleaner output.

Cons

Might miss legitimate small embeds (<5,000 chars).

🧠 Tips for Best Results

Scroll Duration — Increase POST_SCROLL_SETTLE_MS (e.g., 2500–3500) for long or AJAX-heavy pages.
Expand Buttons — Add selectors for “Show more” or “Load all” buttons.
Wait Selectors — Use READY_SELECTORS for complex JS pages.
Disable Analytics Noise — The script blocks tracking calls automatically.
Screenshot Control — Comment out the screenshot lines to disable it.
Profile Isolation — Keep the boilerplate_profile separate from your personal data.

📜 License

👤 About

Mike Dugan — Python Web Scraper & Automation Developer

GitHub: @mdugan8186
Portfolio Website: scraping-portfolio
Fiverr: Hire me for web scraping and custom scrapers
Email: mdugan8186.work@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
script.py		script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🕸️ raw-html-scraper

⚙️ Features

🧰 Requirements

🚀 Usage

Example

🖼️ Screenshot Capture

Screenshot Code Block

Notes

🗂️ Output Structure

🧱 Boilerplate Profile

🪟 Iframe Capture (Advanced)

🧩 A. Capture All Iframes (Full Mode)

⚙️ B. Capture Only Meaningful Iframes (Filtered Mode)

🧠 Tips for Best Results

📜 License

👤 About

About

Uh oh!

Releases

Packages

Languages

License

mdugan8186/raw-html-scraper

Folders and files

Latest commit

History

Repository files navigation

🕸️ raw-html-scraper

⚙️ Features

🧰 Requirements

🚀 Usage

Example

🖼️ Screenshot Capture

Screenshot Code Block

Notes

🗂️ Output Structure

🧱 Boilerplate Profile

🪟 Iframe Capture (Advanced)

🧩 A. Capture All Iframes (Full Mode)

⚙️ B. Capture Only Meaningful Iframes (Filtered Mode)

🧠 Tips for Best Results

📜 License

👤 About

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages