raw-html-scraper is a universal Playwright-based web rendering utility that captures the fully rendered HTML of any webpage — including JavaScript-hydrated, React/Next.js, or dynamically loaded content.
It was designed to work as both a general-purpose HTML collector and a foundational tool for web scraping, data extraction, and archival.
This script loads a webpage just like a real browser, scrolls through content, expands visible sections, and saves the final DOM snapshot — ensuring you capture what the user actually sees rather than the incomplete “View Page Source” version.
- Universal HTML Capture — Works with static and dynamic (JS-heavy) websites
- Auto-scroll Engine — Triggers lazy-loading and infinite-scroll sections
- Expand Button Logic — Clicks common “Read more” or “Show more” elements automatically
- Boilerplate Browser Profile — Persistent, sandboxed Chrome profile for authentication and anti-bot bypass
- Timestamped Outputs — Each run creates unique files named by date (MM-DD-YYYY)
- Iframe Capture — Optionally saves content from all embedded iframes (see below)
- Screenshot Capture — Automatically saves a full-page PNG image of the rendered page
- Analytics Blocking — Skips analytics and telemetry requests to improve speed
- Cross-platform — macOS, Linux, and Windows compatible
- Python 3.9+
- Playwright (installed with browsers):
pip install playwright playwright install chromium
From the project directory (raw-html-scraper/):
python script.py "<URL>" [prefix]python script.py "https://www.fiverr.com/mdugan8186/build-a-custom-python-web-scraper" fiverrThis creates:
fiverr_10-30-2025.html
fiverr_10-30-2025.png
fiverr_10-30-2025_frame1.html
fiverr_10-30-2025_frame2.html
...
Each file is fully rendered HTML, not the raw server source.
A full-page PNG screenshot is automatically saved during each run.
The screenshot shows the page exactly as it was rendered in the browser after scrolling and JavaScript execution.
# ===== Screenshot section =====
# To enable screenshots, leave this code as-is.
# To disable screenshots, simply comment out the lines below.
png_name = get_unique_filename(f"{safe_prefix}_{timestamp}", "png")
page.screenshot(path=png_name, full_page=True)
print(f"[+] Saved screenshot -> {png_name}")full_page=Truecaptures the full scrollable height.- To capture only the visible viewport, remove
full_page=True. - Files are timestamped for each run and saved in the same directory as the HTML.
raw-html-scraper/
├── script.py
├── boilerplate_profile/ # Isolated browser data (cookies, cache, sessions)
├── fiverr_10-30-2025.html # Main rendered HTML
├── fiverr_10-30-2025.png # Screenshot (full page)
├── fiverr_10-30-2025_frame1.html
├── fiverr_10-30-2025_frame2.html
└── ...
A persistent browser profile is automatically created the first time you run the script.
Location (default):
raw-html-scraper/boilerplate_profile/
Purpose:
- Stores cookies, cache, and local storage between runs.
- Keeps you logged into sites without re-entering credentials.
- Helps bypass bot detection by maintaining a consistent browser identity.
- Fully isolated from your personal Chrome profile.
Reset anytime:
rm -rf boilerplate_profileThe folder will be recreated automatically on the next run.
Modern sites often embed sub-pages inside iframes — mini web documents loaded separately from the main page.
These can contain analytics trackers, widgets, or valuable data (videos, reviews, product info, etc.).
Your scraper can handle iframes in two different ways.
This version saves every iframe regardless of size or content.
It’s ideal for full site snapshots, auditing, or forensic capture.
def save_iframes(context, prefix, timestamp):
"""Save HTML content of all child frames (iframes), skipping the main frame."""
count = 0
for page in context.pages:
frames = page.frames
for i, frame in enumerate(frames[1:], start=1):
try:
html = frame.content()
if html and len(html.strip()) > 0:
frame_file = get_unique_filename(f"{prefix}_{timestamp}_frame{i}", "html")
with open(frame_file, "w", encoding="utf-8") as f:
f.write(html)
print(f"[+] Saved iframe -> {frame_file}")
count += 1
except Exception:
continue
return countPros
- Captures all embedded frames (complete record).
- Useful for debugging or investigating hidden elements.
Cons
- May include analytics and ad iframes (clutter).
- Slightly slower.
This skips tiny or irrelevant frames (like trackers or analytics) by checking length.
def save_iframes(context, prefix, timestamp, min_length=5000):
"""Save only meaningful iframe HTMLs. Skips small analytics/tracking frames (< ~5,000 characters)."""
count = 0
for page in context.pages:
frames = page.frames
for i, frame in enumerate(frames[1:], start=1):
try:
html = frame.content()
if not html or len(html.strip()) < min_length:
continue
frame_file = get_unique_filename(f"{prefix}_{timestamp}_frame{i}", "html")
with open(frame_file, "w", encoding="utf-8") as f:
f.write(html)
print(f"[+] Saved iframe -> {frame_file} ({len(html)} chars)")
count += 1
except Exception:
continue
if count == 0:
print("[+] No meaningful iframes captured (all skipped or empty).")
return countPros
- Keeps useful frames while filtering out noise.
- Produces smaller, cleaner output.
Cons
- Might miss legitimate small embeds (<5,000 chars).
- Scroll Duration — Increase
POST_SCROLL_SETTLE_MS(e.g., 2500–3500) for long or AJAX-heavy pages. - Expand Buttons — Add selectors for “Show more” or “Load all” buttons.
- Wait Selectors — Use
READY_SELECTORSfor complex JS pages. - Disable Analytics Noise — The script blocks tracking calls automatically.
- Screenshot Control — Comment out the screenshot lines to disable it.
- Profile Isolation — Keep the
boilerplate_profileseparate from your personal data.
MIT License © 2025
Developed and maintained by Mike Dugan
Mike Dugan — Python Web Scraper & Automation Developer
- GitHub: @mdugan8186
- Portfolio Website: scraping-portfolio
- Fiverr: Hire me for web scraping and custom scrapers
- Email: mdugan8186.work@gmail.com