wcopy

title	description
wcopy	LLM-Focused Web Crawler with Markdown Output

wcopy

A lightweight Go-based web crawler designed specifically for LLM ingestion and RAG pipelines. Extracts clean, structured Markdown from websites while filtering out boilerplate, scripts, and styling.

Features

Markdown Output - Converts HTML to clean Markdown with headers, lists, links, bold, italic, blockquotes
JavaScript Rendering - Uses headless Chrome to capture dynamically loaded content
Smart Boilerplate Removal - Filters navigation, footers, cookie banners, ads, and other noise
No CSS/JS Leakage - Cleans extracted text of any residual code patterns
Content Deduplication - SHA-256 hashing prevents duplicate page storage
Same-Domain Safety - Only crawls within the starting hostname
LLM-Ready Output - Optimized for downstream AI processing

Installation

Prerequisites

Go 1.21+
Google Chrome or Chromium

Build

git clone https://github.com/user/wcopy.git
cd wcopy
go build -o wcopy main.go

Usage

# Basic usage
./wcopy -url https://example.com

# Custom output directory
./wcopy -url https://docs.example.com -out ./crawled_docs

Command Line Flags

Flag	Required	Default	Description
`-url`	Yes	-	Starting URL to crawl
`-out`	No	`output`	Output directory

Output Structure

output/
├── html/                           # Raw rendered HTML (post-JavaScript)
│   ├── https_example.com_.html
│   └── https_example.com_docs_.html
└── markdown/                       # Clean Markdown files
    ├── https_example.com_.md
    └── https_example.com_docs_.md

Markdown Conversion

wcopy converts HTML elements to proper Markdown formatting:

HTML Element	Markdown Output
`<h1>`	`# Heading`
`<h2>`	`## Heading`
`<h3>`	`### Heading`
`<p>`	Paragraph with blank lines
`<ul><li>`	`- List item`
`<ol><li>`	`1. List item`
`<a href="url">`	`[text](url)`
`<strong>`, `<b>`	`bold`
`<em>`, `<i>`	`italic`
`<code>`	`code`
`<pre>`	```code block```
`<blockquote>`	`> quoted text`
`<img>`	`![alt](src)`
`<table>`	Markdown table with `\|`

Example Output

Input HTML:

<h2>About Us</h2>
<p>The <em>Iglesia Ni Cristo</em> is a Christian religion.</p>
<blockquote>"So always be ready..."</blockquote>
<a href="/history">Learn more</a>

Output Markdown:

## About Us

The *Iglesia Ni Cristo* is a Christian religion.

> "So always be ready..."

[Learn more](/history)

Content Extraction

What Gets Extracted

Main article content
Headings (h1-h6)
Paragraphs
Lists (ordered and unordered)
Tables (as Markdown)
Blockquotes
Links with URLs
Images with alt text
Bold and italic text
Code blocks

What Gets Filtered

<script> tags and inline JavaScript
<style> tags and CSS rules
Navigation bars (<nav>, .nav, .menu)
Footers (<footer>, .footer)
Cookie consent banners
Advertisement blocks
Social media widgets
Forms and inputs
SVG icons and paths
Elements with aria-hidden="true"

Boilerplate Detection

The crawler uses multiple heuristics to identify and remove boilerplate:

Tag Blacklist - script, style, nav, footer, aside, form, noscript, iframe, svg
Class/ID Patterns - Elements containing: nav, menu, footer, cookie, consent, banner, modal, ads, promo, social, share, popup, overlay
Cookie Consent Text - Phrases like "We use cookies", "privacy policy", "accept all cookies"
Code Pattern Detection - Regex patterns catch residual CSS (background:, .class{}) and JavaScript (var x =, $(function))
Text Density Scoring - For original text extraction, blocks with low text-to-link ratios are filtered

Crawling Behavior

Recursive - Follows all links within the same domain
BFS Order - Breadth-first traversal of pages
Deduplication - Content hashing prevents saving identical pages
URL Normalization - Removes fragments (#section), normalizes paths
Single Visit - Each URL processed exactly once

URL Filtering

Automatically skips:

External domains
mailto: links
javascript: pseudo-protocol
URL fragments (removed before comparison)

Performance Characteristics

Aspect	Description
Extraction Speed	~1-3ms per page (heuristic-based)
Memory Usage	Low (streaming processing)
JavaScript Rendering	Enabled via headless Chrome
NLP/ML	None (fast heuristics only)
Parallelism	Single-threaded queue

Known Limitations

No robots.txt handling
No rate limiting or backoff
No sitemap.xml parsing
No retry logic for failed requests
Single-threaded crawling
Requires Chrome/Chromium installed
Not suitable for very large sites (10,000+ pages) without modification

Design Philosophy

"Extract less, but extract better."

wcopy prioritizes signal over completeness. Every design decision favors:

LLM usefulness over raw data volume
Clean output over comprehensive capture
Speed over perfect accuracy
Simplicity over configurability

The tool accepts 80-90% extraction accuracy because LLMs tolerate minor noise but suffer from missing critical content.

Expected Accuracy

Site Type	Extraction Accuracy
Documentation	90-95%
Blogs	90-95%
Marketing Sites	80-90%
JS-Heavy SPAs	75-85%
E-commerce	70-80%

Use Cases

Preparing datasets for LLM fine-tuning
Building RAG (Retrieval-Augmented Generation) corpora
Website documentation ingestion
Knowledge base archiving
Content migration projects
Research data collection

Dependencies

chromedp - Headless Chrome automation
golang.org/x/net/html - HTML parsing

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
WHY.MD		WHY.MD
go.mod		go.mod
go.sum		go.sum
main.go		main.go
wcopy		wcopy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wcopy

Features

Installation

Prerequisites

Build

Usage

Command Line Flags

Output Structure

Markdown Conversion

Example Output

Content Extraction

What Gets Extracted

What Gets Filtered

Boilerplate Detection

Crawling Behavior

URL Filtering

Performance Characteristics

Known Limitations

Design Philosophy

Expected Accuracy

Use Cases

Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wcopy

Features

Installation

Prerequisites

Build

Usage

Command Line Flags

Output Structure

Markdown Conversion

Example Output

Content Extraction

What Gets Extracted

What Gets Filtered

Boilerplate Detection

Crawling Behavior

URL Filtering

Performance Characteristics

Known Limitations

Design Philosophy

Expected Accuracy

Use Cases

Dependencies

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages