A Go application that scrapes websites and creates fully self-contained static HTML copies with all assets (CSS, JavaScript, images, fonts) downloaded and localized for completely offline viewing.
- High-performance concurrent downloads: Optimized worker pool with HTTP connection pooling for 50%+ faster scraping
- Complete website capture: Downloads and saves web pages as fully self-contained static HTML
- CSS & JavaScript: Localizes all external stylesheets and scripts (including preload links)
- Image processing: Downloads all images including responsive
srcsetvariants, background images, and meta tag images - Advanced font discovery: Automatically downloads font files from CSS, including:
- Protocol-relative URLs (
//domain.com/font.woff2) - Inline CSS within
<style>tags - FontAwesome and Google Fonts
- All formats: TTF, WOFF, WOFF2, EOT, SVG
- Protocol-relative URLs (
- Smart asset detection: Finds both absolute URLs and relative paths in CSS files
- True parallelism: All asset types (CSS, JS, images, fonts) download simultaneously
- Error prevention:
- Removes source map references to prevent browser errors
- Suppresses development server connection errors
- Handles security errors from service worker origin mismatches
- Built-in HTTP server: Serves scraped content locally with proper routing
- Auto-cleanup: Automatically removes old files before each scrape
- Organized structure: Creates clean directory structure with separate folders for different asset types
go build -ldflags "-s -w" -o wp-static-scraper main.goThe application supports two main commands: scrape for downloading websites and serve for serving the scraped content.
# Basic scraping
./wp-static-scraper scrape -url "https://example.com"
# Specify output file
./wp-static-scraper scrape -url "https://example.com" -out "my-page.html"
# High-performance scraping with custom concurrency
./wp-static-scraper scrape -url "https://example.com" -concurrency 50# Start server on default port (8080)
./wp-static-scraper serve
# Start server on custom port
./wp-static-scraper serve -port 3000Scrape command:
-url: (Required) The URL of the website to scrape-out: (Optional) Output HTML file path (default: "index.html")-concurrency: (Optional) Number of concurrent download workers, 1-100 (default: 100)
Serve command:
-port: (Optional) Port for HTTP server (default: 8080)
When you run the scraper, it creates an organized output/ directory:
output/index.html(or custom filename) with all references updated to local assetsoutput/assets/directory containing downloaded CSS and JavaScript filesoutput/assets/fonts/directory containing all downloaded font files (TTF, WOFF, WOFF2, EOT, SVG formats)output/assets/images/directory containing all downloaded images (PNG, JPG, GIF, WebP, SVG formats)
# 1. Scrape a website
./wp-static-scraper scrape -url "https://example.com"
# 2. Start local server to view the scraped content
./wp-static-scraper serve
# 3. Open http://localhost:8080 in your browserThis creates the following structure:
output/
├── index.html
└── assets/
├── all.min.css
├── style.css
├── main.js
├── other-scripts.js
├── fonts/
│ ├── fa-brands-400.woff2
│ ├── fa-regular-400.woff2
│ ├── fa-solid-900.woff2
│ ├── Montserrat-Regular.woff2
│ └── other-fonts...
└── images/
├── logo.png
├── hero-image.jpg
├── banner-mobile.webp
├── icon.svg
└── other-images...
The scraper intelligently discovers and downloads all assets referenced in web pages:
Fonts:
- FontAwesome: Automatically detects and downloads all FontAwesome font variants
- Google Fonts: Downloads custom fonts from Google Fonts and other CDNs
- Protocol-relative URLs: Handles
//domain.com/font.woff2references - Inline CSS: Processes fonts in
<style>tags - Multiple formats: Supports TTF, WOFF, WOFF2, EOT, and SVG font formats
- Relative paths: Handles both absolute URLs and relative paths (e.g.,
../webfonts/)
Images:
- Responsive images: Processes
srcsetattributes with size descriptors - Background images: Extracts images from inline
styleattributes - Meta images: Downloads og:image, twitter:image, and other meta tag images
- Lazy loading: Handles
data-srcattributes for deferred loading - All formats: PNG, JPG, GIF, WebP, SVG, and more
Scripts & Styles:
- Preload links: Properly handles
<link rel="preload">tags - Source maps: Removes
sourceMappingURLreferences to prevent errors - Error suppression: Injects scripts to handle development server errors
- Parses HTML, CSS, and JavaScript to find all asset references
- Resolves relative paths against the original website's base URL
- Updates all references to use local paths for offline viewing
- Handles complex CSS with nested imports and font-face declarations
- Each scrape automatically removes previous assets and HTML files
- Prevents mixing assets from different websites
- Ensures fresh, clean results every time
- Creates organized directory structure for easy navigation
The scraper uses an optimized concurrent download system for maximum performance:
- True parallelism: All asset types download simultaneously (not in sequential phases)
- HTTP connection pooling: Reuses connections for better network efficiency
- Optimized worker pool: Simple job queue with atomic counters eliminates bottlenecks
- Non-blocking retries: Failed downloads retry asynchronously without blocking workers
- Upfront asset discovery: Finds all assets including fonts from inline CSS immediately
Benchmark: 53% performance improvement (10s → 4.7s) on complex websites with 50 concurrent workers.
The application is built with a modular package structure for maintainability and clarity:
main.go: Entry point with command routingcommands/: Command handlers for scrape, serve, and usageassets/: High-performance asset downloading with concurrent worker poolhtml/: HTML processing and error suppression utilitiesutils/: Shared utilities for cleanup and URL resolutionoutput/: Generated directory containing scraped content
- Go 1.24.0 or later
- Internet connection for downloading assets
This project is licensed under the MIT License - see the LICENSE file for details.