Skip to content

PithomLabs/wcopy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

title description
wcopy
LLM-Focused Web Crawler with Markdown Output

wcopy

A lightweight Go-based web crawler designed specifically for LLM ingestion and RAG pipelines. Extracts clean, structured Markdown from websites while filtering out boilerplate, scripts, and styling.

Features

  • Markdown Output - Converts HTML to clean Markdown with headers, lists, links, bold, italic, blockquotes
  • JavaScript Rendering - Uses headless Chrome to capture dynamically loaded content
  • Smart Boilerplate Removal - Filters navigation, footers, cookie banners, ads, and other noise
  • No CSS/JS Leakage - Cleans extracted text of any residual code patterns
  • Content Deduplication - SHA-256 hashing prevents duplicate page storage
  • Same-Domain Safety - Only crawls within the starting hostname
  • LLM-Ready Output - Optimized for downstream AI processing

Installation

Prerequisites

  • Go 1.21+
  • Google Chrome or Chromium

Build

git clone https://github.com/user/wcopy.git
cd wcopy
go build -o wcopy main.go

Usage

# Basic usage
./wcopy -url https://example.com

# Custom output directory
./wcopy -url https://docs.example.com -out ./crawled_docs

Command Line Flags

Flag Required Default Description
-url Yes - Starting URL to crawl
-out No output Output directory

Output Structure

output/
├── html/                           # Raw rendered HTML (post-JavaScript)
│   ├── https_example.com_.html
│   └── https_example.com_docs_.html
└── markdown/                       # Clean Markdown files
    ├── https_example.com_.md
    └── https_example.com_docs_.md

Markdown Conversion

wcopy converts HTML elements to proper Markdown formatting:

HTML Element Markdown Output
<h1> # Heading
<h2> ## Heading
<h3> ### Heading
<p> Paragraph with blank lines
<ul><li> - List item
<ol><li> 1. List item
<a href="url"> [text](url)
<strong>, <b> **bold**
<em>, <i> *italic*
<code> `code`
<pre> ```code block```
<blockquote> > quoted text
<img> ![alt](src)
<table> Markdown table with |

Example Output

Input HTML:

<h2>About Us</h2>
<p>The <em>Iglesia Ni Cristo</em> is a Christian religion.</p>
<blockquote>"So always be ready..."</blockquote>
<a href="/history">Learn more</a>

Output Markdown:

## About Us

The *Iglesia Ni Cristo* is a Christian religion.

> "So always be ready..."

[Learn more](/history)

Content Extraction

What Gets Extracted

  • Main article content
  • Headings (h1-h6)
  • Paragraphs
  • Lists (ordered and unordered)
  • Tables (as Markdown)
  • Blockquotes
  • Links with URLs
  • Images with alt text
  • Bold and italic text
  • Code blocks

What Gets Filtered

  • <script> tags and inline JavaScript
  • <style> tags and CSS rules
  • Navigation bars (<nav>, .nav, .menu)
  • Footers (<footer>, .footer)
  • Cookie consent banners
  • Advertisement blocks
  • Social media widgets
  • Forms and inputs
  • SVG icons and paths
  • Elements with aria-hidden="true"

Boilerplate Detection

The crawler uses multiple heuristics to identify and remove boilerplate:

  1. Tag Blacklist - script, style, nav, footer, aside, form, noscript, iframe, svg

  2. Class/ID Patterns - Elements containing: nav, menu, footer, cookie, consent, banner, modal, ads, promo, social, share, popup, overlay

  3. Cookie Consent Text - Phrases like "We use cookies", "privacy policy", "accept all cookies"

  4. Code Pattern Detection - Regex patterns catch residual CSS (background:, .class{}) and JavaScript (var x =, $(function))

  5. Text Density Scoring - For original text extraction, blocks with low text-to-link ratios are filtered

Crawling Behavior

  • Recursive - Follows all links within the same domain
  • BFS Order - Breadth-first traversal of pages
  • Deduplication - Content hashing prevents saving identical pages
  • URL Normalization - Removes fragments (#section), normalizes paths
  • Single Visit - Each URL processed exactly once

URL Filtering

Automatically skips:

  • External domains
  • mailto: links
  • javascript: pseudo-protocol
  • URL fragments (removed before comparison)

Performance Characteristics

Aspect Description
Extraction Speed ~1-3ms per page (heuristic-based)
Memory Usage Low (streaming processing)
JavaScript Rendering Enabled via headless Chrome
NLP/ML None (fast heuristics only)
Parallelism Single-threaded queue

Known Limitations

  • No robots.txt handling
  • No rate limiting or backoff
  • No sitemap.xml parsing
  • No retry logic for failed requests
  • Single-threaded crawling
  • Requires Chrome/Chromium installed
  • Not suitable for very large sites (10,000+ pages) without modification

Design Philosophy

"Extract less, but extract better."

wcopy prioritizes signal over completeness. Every design decision favors:

  1. LLM usefulness over raw data volume
  2. Clean output over comprehensive capture
  3. Speed over perfect accuracy
  4. Simplicity over configurability

The tool accepts 80-90% extraction accuracy because LLMs tolerate minor noise but suffer from missing critical content.

Expected Accuracy

Site Type Extraction Accuracy
Documentation 90-95%
Blogs 90-95%
Marketing Sites 80-90%
JS-Heavy SPAs 75-85%
E-commerce 70-80%

Use Cases

  • Preparing datasets for LLM fine-tuning
  • Building RAG (Retrieval-Augmented Generation) corpora
  • Website documentation ingestion
  • Knowledge base archiving
  • Content migration projects
  • Research data collection

Dependencies

License

MIT License

About

Like wget and HTTrack but not complicated (for LLM use)

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages