Web Loader Engine

High-performance web content extraction engine built in Rust. Primary purpose is serving as an external web loader for OpenWebUI, but it's flexible enough for any use case that needs clean content extraction from web pages - RAG pipelines, content indexing, web scraping, archiving, and more.

Features

OpenWebUI Compatible - Native API support, drop-in replacement
Multiple Output Formats - Markdown, HTML, plain text, screenshots
Readability Extraction - Mozilla Readability algorithm for clean article content
JavaScript Rendering - Chromium-based rendering for JS-heavy sites
Smart Caching - Built-in response caching with configurable TTL
Rate Limiting - Per-domain rate limiting and circuit breakers
Batch Processing - Process multiple URLs concurrently
Security - SSRF protection, blocked internal IPs, optional API key auth

Quick Start

Docker (Recommended)

docker build -t web-loader-engine .
docker run -d -p 14786:14786 --name web-loader web-loader-engine

Docker Compose

services:
  web-loader:
    build: .
    ports:
      - "14786:14786"
    environment:
      - BROWSER_POOL_SIZE=10
      - CACHE_TTL=3600
      # - API_KEY=your-secret-key
    volumes:
      - screenshots:/app/screenshots
    restart: unless-stopped

volumes:
  screenshots:

docker-compose up -d

Docker Hub (Pre-built Image)

services:
  web-loader:
    image: edgaras0x4e/web-loader-engine:latest
    ports:
      - "14786:14786"
    environment:
      - BROWSER_POOL_SIZE=10
      - CACHE_TTL=3600
      # - API_KEY=your-secret-key
    volumes:
      - screenshots:/app/screenshots
    restart: unless-stopped

volumes:
  screenshots:

docker-compose up -d

Then set OpenWebUI's web loader URL to http://web-loader:14786

From Source

Requires Rust 1.70+ and Chrome/Chromium installed.

cp .env.example .env  # Configure settings
cargo build --release
./target/release/web-loader-engine

Configuration

Copy the example environment file and adjust as needed:

cp .env.example .env

Environment variables:

Variable	Default	Description
`API_PORT`	`14786`	Server port
`API_KEY`	-	Optional API key for authentication
`CHROME_PATH`	`/usr/bin/chromium`	Path to Chrome/Chromium binary
`BROWSER_POOL_SIZE`	`10`	Concurrent browser pages
`REQUEST_TIMEOUT`	`30`	Default timeout in seconds
`CACHE_TTL`	`3600`	Cache lifetime in seconds
`SCREENSHOT_DIR`	`/app/screenshots`	Screenshot storage path

API

OpenWebUI Endpoint

POST /

{"urls": ["https://example.com/article"]}

Returns:

[
  {
    "page_content": "# Article Title\n\nContent...",
    "metadata": {
      "source": "https://example.com/article",
      "title": "Article Title"
    }
  }
]

Single URL

POST /load

{"url": "https://example.com"}

Response:

{
  "url": "https://example.com",
  "title": "Example Domain",
  "content": "# Example Domain\n\nThis domain is for examples...",
  "metadata": {
    "processing_time_ms": 1234,
    "cached": false
  }
}

Batch

POST /load/batch

{"urls": ["https://example.com/1", "https://example.com/2"]}

Response:

{
  "results": [
    {
      "url": "https://example.com/1",
      "response": {
        "url": "https://example.com/1",
        "title": "Page Title",
        "content": "...",
        "metadata": {"processing_time_ms": 500, "cached": false}
      }
    }
  ],
  "total_processing_time_ms": 1234
}

Health Check

GET /health

Request Headers

Header	Values	Description
`x-respond-with`	`markdown`, `html`, `text`, `screenshot`, `pageshot`	Output format
`x-wait-for-selector`	CSS selector	Wait for element before extraction
`x-target-selector`	CSS selector	Extract only matching content
`x-remove-selector`	CSS selector	Remove elements before extraction
`x-timeout`	seconds	Request timeout
`x-set-cookie`	`name=value`	Set cookies
`x-no-cache`	`true`	Bypass cache
`x-with-images-summary`	`true`	Include images list
`x-with-links-summary`	`true`	Include links list
`Authorization`	`Bearer <key>`	API key (if configured)

Request Body Options (all optional)

{
  "url": "https://example.com",
  "options": {
    "wait_for_selector": "#content",
    "target_selector": "article",
    "remove_selector": ".ads",
    "timeout": 60
  }
}

Other Use Cases

While built for OpenWebUI, this works for:

RAG Pipelines - Clean content for embeddings and retrieval
Content Archiving - Save readable versions of web pages
Web Scraping - Extract data from JavaScript-rendered pages
Screenshot Services - Programmatic page captures
Search Indexing - Extract text content for indexing

Changelog

v0.1.1

Browser Pool Resilience - Fixed critical issue where dead browser connections would cause requests to hang indefinitely.

Added automatic browser health detection with 5-second timeout on page creation
Implemented connection error detection for Ws(AlreadyClosed) and related WebSocket errors
Auto-recovery: dead browsers are now automatically recreated on connection failure
Request-level retry logic (up to 3 retries) for transient connection errors
Health endpoint now exposes healthy status and recreation_count for monitoring

Health response now includes:

{
  "status": "ok",
  "version": "0.1.1",
  "browser_pool": {
    "available": 10,
    "total": 10,
    "healthy": true,
    "recreation_count": 1
  }
}

Monitor recreation_count increasing to track browser recovery events.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Loader Engine

Features

Quick Start

Docker (Recommended)

Docker Compose

Docker Hub (Pre-built Image)

From Source

Configuration

API

OpenWebUI Endpoint

Single URL

Batch

Health Check

Request Headers

Request Body Options (all optional)

Other Use Cases

Changelog

v0.1.1

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Loader Engine

Features

Quick Start

Docker (Recommended)

Docker Compose

Docker Hub (Pre-built Image)

From Source

Configuration

API

OpenWebUI Endpoint

Single URL

Batch

Health Check

Request Headers

Request Body Options (all optional)

Other Use Cases

Changelog

v0.1.1

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages