| title | description |
|---|---|
wcopy |
LLM-Focused Web Crawler with Markdown Output |
A lightweight Go-based web crawler designed specifically for LLM ingestion and RAG pipelines. Extracts clean, structured Markdown from websites while filtering out boilerplate, scripts, and styling.
- Markdown Output - Converts HTML to clean Markdown with headers, lists, links, bold, italic, blockquotes
- JavaScript Rendering - Uses headless Chrome to capture dynamically loaded content
- Smart Boilerplate Removal - Filters navigation, footers, cookie banners, ads, and other noise
- No CSS/JS Leakage - Cleans extracted text of any residual code patterns
- Content Deduplication - SHA-256 hashing prevents duplicate page storage
- Same-Domain Safety - Only crawls within the starting hostname
- LLM-Ready Output - Optimized for downstream AI processing
- Go 1.21+
- Google Chrome or Chromium
git clone https://github.com/user/wcopy.git
cd wcopy
go build -o wcopy main.go# Basic usage
./wcopy -url https://example.com
# Custom output directory
./wcopy -url https://docs.example.com -out ./crawled_docs| Flag | Required | Default | Description |
|---|---|---|---|
-url |
Yes | - | Starting URL to crawl |
-out |
No | output |
Output directory |
output/
├── html/ # Raw rendered HTML (post-JavaScript)
│ ├── https_example.com_.html
│ └── https_example.com_docs_.html
└── markdown/ # Clean Markdown files
├── https_example.com_.md
└── https_example.com_docs_.md
wcopy converts HTML elements to proper Markdown formatting:
| HTML Element | Markdown Output |
|---|---|
<h1> |
# Heading |
<h2> |
## Heading |
<h3> |
### Heading |
<p> |
Paragraph with blank lines |
<ul><li> |
- List item |
<ol><li> |
1. List item |
<a href="url"> |
[text](url) |
<strong>, <b> |
**bold** |
<em>, <i> |
*italic* |
<code> |
`code` |
<pre> |
```code block``` |
<blockquote> |
> quoted text |
<img> |
 |
<table> |
Markdown table with | |
Input HTML:
<h2>About Us</h2>
<p>The <em>Iglesia Ni Cristo</em> is a Christian religion.</p>
<blockquote>"So always be ready..."</blockquote>
<a href="/history">Learn more</a>Output Markdown:
## About Us
The *Iglesia Ni Cristo* is a Christian religion.
> "So always be ready..."
[Learn more](/history)- Main article content
- Headings (h1-h6)
- Paragraphs
- Lists (ordered and unordered)
- Tables (as Markdown)
- Blockquotes
- Links with URLs
- Images with alt text
- Bold and italic text
- Code blocks
<script>tags and inline JavaScript<style>tags and CSS rules- Navigation bars (
<nav>,.nav,.menu) - Footers (
<footer>,.footer) - Cookie consent banners
- Advertisement blocks
- Social media widgets
- Forms and inputs
- SVG icons and paths
- Elements with
aria-hidden="true"
The crawler uses multiple heuristics to identify and remove boilerplate:
-
Tag Blacklist -
script,style,nav,footer,aside,form,noscript,iframe,svg -
Class/ID Patterns - Elements containing:
nav,menu,footer,cookie,consent,banner,modal,ads,promo,social,share,popup,overlay -
Cookie Consent Text - Phrases like "We use cookies", "privacy policy", "accept all cookies"
-
Code Pattern Detection - Regex patterns catch residual CSS (
background:,.class{}) and JavaScript (var x =,$(function)) -
Text Density Scoring - For original text extraction, blocks with low text-to-link ratios are filtered
- Recursive - Follows all links within the same domain
- BFS Order - Breadth-first traversal of pages
- Deduplication - Content hashing prevents saving identical pages
- URL Normalization - Removes fragments (
#section), normalizes paths - Single Visit - Each URL processed exactly once
Automatically skips:
- External domains
mailto:linksjavascript:pseudo-protocol- URL fragments (removed before comparison)
| Aspect | Description |
|---|---|
| Extraction Speed | ~1-3ms per page (heuristic-based) |
| Memory Usage | Low (streaming processing) |
| JavaScript Rendering | Enabled via headless Chrome |
| NLP/ML | None (fast heuristics only) |
| Parallelism | Single-threaded queue |
- No
robots.txthandling - No rate limiting or backoff
- No sitemap.xml parsing
- No retry logic for failed requests
- Single-threaded crawling
- Requires Chrome/Chromium installed
- Not suitable for very large sites (10,000+ pages) without modification
"Extract less, but extract better."
wcopy prioritizes signal over completeness. Every design decision favors:
- LLM usefulness over raw data volume
- Clean output over comprehensive capture
- Speed over perfect accuracy
- Simplicity over configurability
The tool accepts 80-90% extraction accuracy because LLMs tolerate minor noise but suffer from missing critical content.
| Site Type | Extraction Accuracy |
|---|---|
| Documentation | 90-95% |
| Blogs | 90-95% |
| Marketing Sites | 80-90% |
| JS-Heavy SPAs | 75-85% |
| E-commerce | 70-80% |
- Preparing datasets for LLM fine-tuning
- Building RAG (Retrieval-Augmented Generation) corpora
- Website documentation ingestion
- Knowledge base archiving
- Content migration projects
- Research data collection
- chromedp - Headless Chrome automation
- golang.org/x/net/html - HTML parsing
MIT License