wp-static-scraper

A Go application that scrapes websites and creates fully self-contained static HTML copies with all assets (CSS, JavaScript, images, fonts) downloaded and localized for completely offline viewing.

Features

High-performance concurrent downloads: Optimized worker pool with HTTP connection pooling for 50%+ faster scraping
Complete website capture: Downloads and saves web pages as fully self-contained static HTML
CSS & JavaScript: Localizes all external stylesheets and scripts (including preload links)
Image processing: Downloads all images including responsive srcset variants, background images, and meta tag images
Advanced font discovery: Automatically downloads font files from CSS, including:
- Protocol-relative URLs (//domain.com/font.woff2)
- Inline CSS within <style> tags
- FontAwesome and Google Fonts
- All formats: TTF, WOFF, WOFF2, EOT, SVG
Smart asset detection: Finds both absolute URLs and relative paths in CSS files
True parallelism: All asset types (CSS, JS, images, fonts) download simultaneously
Error prevention:
- Removes source map references to prevent browser errors
- Suppresses development server connection errors
- Handles security errors from service worker origin mismatches
Built-in HTTP server: Serves scraped content locally with proper routing
Auto-cleanup: Automatically removes old files before each scrape
Organized structure: Creates clean directory structure with separate folders for different asset types

Installation

go build -ldflags "-s -w" -o wp-static-scraper main.go

Usage

The application supports two main commands: scrape for downloading websites and serve for serving the scraped content.

Scraping Websites

# Basic scraping
./wp-static-scraper scrape -url "https://example.com"

# Specify output file
./wp-static-scraper scrape -url "https://example.com" -out "my-page.html"

# High-performance scraping with custom concurrency
./wp-static-scraper scrape -url "https://example.com" -concurrency 50

Serving Scraped Content

# Start server on default port (8080)
./wp-static-scraper serve

# Start server on custom port
./wp-static-scraper serve -port 3000

Command Line Options

Scrape command:

-url: (Required) The URL of the website to scrape
-out: (Optional) Output HTML file path (default: "index.html")
-concurrency: (Optional) Number of concurrent download workers, 1-100 (default: 100)

Serve command:

-port: (Optional) Port for HTTP server (default: 8080)

Output Structure

When you run the scraper, it creates an organized output/ directory:

output/index.html (or custom filename) with all references updated to local assets
output/assets/ directory containing downloaded CSS and JavaScript files
output/assets/fonts/ directory containing all downloaded font files (TTF, WOFF, WOFF2, EOT, SVG formats)
output/assets/images/ directory containing all downloaded images (PNG, JPG, GIF, WebP, SVG formats)

Example Workflow

# 1. Scrape a website
./wp-static-scraper scrape -url "https://example.com"

# 2. Start local server to view the scraped content
./wp-static-scraper serve

# 3. Open http://localhost:8080 in your browser

This creates the following structure:

output/
├── index.html
└── assets/
    ├── all.min.css
    ├── style.css
    ├── main.js
    ├── other-scripts.js
    ├── fonts/
    │   ├── fa-brands-400.woff2
    │   ├── fa-regular-400.woff2
    │   ├── fa-solid-900.woff2
    │   ├── Montserrat-Regular.woff2
    │   └── other-fonts...
    └── images/
        ├── logo.png
        ├── hero-image.jpg
        ├── banner-mobile.webp
        ├── icon.svg
        └── other-images...

Key Features

Comprehensive Asset Support

The scraper intelligently discovers and downloads all assets referenced in web pages:

Fonts:

FontAwesome: Automatically detects and downloads all FontAwesome font variants
Google Fonts: Downloads custom fonts from Google Fonts and other CDNs
Protocol-relative URLs: Handles //domain.com/font.woff2 references
Inline CSS: Processes fonts in <style> tags
Multiple formats: Supports TTF, WOFF, WOFF2, EOT, and SVG font formats
Relative paths: Handles both absolute URLs and relative paths (e.g., ../webfonts/)

Images:

Responsive images: Processes srcset attributes with size descriptors
Background images: Extracts images from inline style attributes
Meta images: Downloads og:image, twitter:image, and other meta tag images
Lazy loading: Handles data-src attributes for deferred loading
All formats: PNG, JPG, GIF, WebP, SVG, and more

Scripts & Styles:

Preload links: Properly handles <link rel="preload"> tags
Source maps: Removes sourceMappingURL references to prevent errors
Error suppression: Injects scripts to handle development server errors

Smart Asset Detection

Parses HTML, CSS, and JavaScript to find all asset references
Resolves relative paths against the original website's base URL
Updates all references to use local paths for offline viewing
Handles complex CSS with nested imports and font-face declarations

Clean Workflow

Each scrape automatically removes previous assets and HTML files
Prevents mixing assets from different websites
Ensures fresh, clean results every time
Creates organized directory structure for easy navigation

Performance

The scraper uses an optimized concurrent download system for maximum performance:

True parallelism: All asset types download simultaneously (not in sequential phases)
HTTP connection pooling: Reuses connections for better network efficiency
Optimized worker pool: Simple job queue with atomic counters eliminates bottlenecks
Non-blocking retries: Failed downloads retry asynchronously without blocking workers
Upfront asset discovery: Finds all assets including fonts from inline CSS immediately

Benchmark: 53% performance improvement (10s → 4.7s) on complex websites with 50 concurrent workers.

Architecture

The application is built with a modular package structure for maintainability and clarity:

main.go: Entry point with command routing
commands/: Command handlers for scrape, serve, and usage
assets/: High-performance asset downloading with concurrent worker pool
html/: HTML processing and error suppression utilities
utils/: Shared utilities for cleanup and URL resolution
output/: Generated directory containing scraped content

Requirements

Go 1.24.0 or later
Internet connection for downloading assets

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
commands		commands
html		html
utils		utils
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
main_test.go		main_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

wp-static-scraper

Features

Installation

Usage

Scraping Websites

Serving Scraped Content

Command Line Options

Output Structure

Example Workflow

Key Features

Comprehensive Asset Support

Smart Asset Detection

Clean Workflow

Performance

Architecture

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

dorianlgs/wp-static-scraper

Folders and files

Latest commit

History

Repository files navigation

wp-static-scraper

Features

Installation

Usage

Scraping Websites

Serving Scraped Content

Command Line Options

Output Structure

Example Workflow

Key Features

Comprehensive Asset Support

Smart Asset Detection

Clean Workflow

Performance

Architecture

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages