webpipe

Local “search → fetch → extract” plumbing, exposed as an MCP stdio server.

webpipe is designed for bounded, deterministic evidence gathering:

cache-first fetch by default
explicit limits on bytes/chars/results
stable, schema-versioned JSON outputs

Avoid Cargo build locks (dev ergonomics)

If you run multiple cargo commands concurrently, you can hit: Blocking waiting for file lock on build directory.

For webpipe work, you can isolate build output by setting a per-repo target dir:

CARGO_TARGET_DIR=target-webpipe cargo test -p webpipe-mcp --tests -- --skip live_

Workspace layout

crates/webpipe: public facade crate (re-exports webpipe-core)
crates/webpipe-core: backend-agnostic types + traits
crates/webpipe-local: local implementations (reqwest + filesystem cache + extractors)
crates/webpipe-mcp: MCP stdio server + CLI harness (webpipe)

Install + run

# From the repo root:
cargo install --path crates/webpipe-mcp --bin webpipe --features stdio --force

# Run as an MCP stdio server:
webpipe mcp-stdio

Cursor MCP setup

Use mcp.example.json as a starting point (it intentionally contains no API keys).

Keyless workflow tip:

Use the MCP tool web_seed_urls to get curated “awesome list” seed URLs, then call web_search_extract with urls=[...] and url_selection_mode=query_rank.

Two audiences (how to think about the tools)

There are two “modes” in which webpipe is used:

Cursor user (MCP tools): you (or Cursor agents) call tools like web_search_extract directly from chat. The tool response is a single JSON payload that Cursor saves/displays as text.
Agent inside one tool call: a higher-level agent/tool (outside of webpipe, or inside your own app) calls webpipe tools as steps in a workflow. The same JSON payloads are used, but you typically branch on fields like warning_codes, error.code, and top_chunks.

In both cases, the “structured output” is the JSON object:

ok: true|false
schema_version, kind, elapsed_ms
optional warnings + warning_codes + warning_hints
on error: error: { code, message, hint, retryable }

Tool cookbook (Cursor / MCP users)

These examples are written as tool arguments (what you paste into the tool call UI). Outputs are JSON-as-text; look at the ok, warning_codes, and the tool-specific fields described below.

`webpipe_meta` (what’s configured)

{}

Look for:

configured.*: which providers/backends are available
supported.*: allowed enum values
defaults.*: what you get if you omit fields

`web_search_extract` (the main “do the job” tool)

Online query-mode:

{
  "query": "latest arxiv papers on tool-using LLM agents",
  "provider": "auto",
  "auto_mode": "fallback",
  "fetch_backend": "local",
  "agentic": true,
  "max_results": 5,
  "max_urls": 3,
  "top_chunks": 5,
  "max_chunk_chars": 500
}

Offline-ish urls-mode (no search; optionally rank chunks by query):

{
  "query": "installation steps",
  "urls": ["https://example.com/docs/install", "https://example.com/docs/getting-started"],
  "url_selection_mode": "query_rank",
  "fetch_backend": "local",
  "no_network": false,
  "max_urls": 2,
  "top_chunks": 4
}

Look for:

top_chunks[]: the best evidence snippets (bounded)
results[]: per-URL details
warning_hints: concrete next moves when extraction is empty/low-signal/blocked

`web_extract` (single URL, extraction-first)

{
  "url": "https://example.com/docs/install",
  "fetch_backend": "local",
  "include_text": true,
  "max_chars": 20000,
  "include_links": false
}

Use this when you already know the URL and want extraction, not discovery.

`web_fetch` (single URL, bytes/text-first)

{
  "url": "https://example.com/",
  "fetch_backend": "local",
  "include_text": false,
  "include_headers": false,
  "max_bytes": 5000000
}

Use this when you need status/content-type/bytes and only sometimes need bounded text.

`web_search` (search only)

{
  "query": "rmcp mcp server examples",
  "provider": "auto",
  "auto_mode": "fallback",
  "max_results": 10
}

`web_cache_search_extract` (no network; cache corpus only)

{
  "query": "route handlers",
  "max_docs": 50,
  "top_chunks": 5
}

Use this for deterministic “did we already fetch something relevant?” checks.

`web_deep_research` (evidence gather + optional synthesis)

Evidence-only mode (most inspectable):

{
  "query": "What are the main recent approaches to tool-using LLM agents?",
  "synthesize": false,
  "include_evidence": true,
  "provider": "auto",
  "fetch_backend": "local",
  "max_urls": 3,
  "top_chunks": 5
}

`arxiv_search` / `arxiv_enrich`

{
  "query": "tool-using agents",
  "categories": ["cs.AI", "cs.CL"],
  "per_page": 10
}

{
  "id_or_url": "https://arxiv.org/abs/2401.12345",
  "include_discussions": true,
  "max_discussion_urls": 2
}

Tool cookbook (agent/workflow authors)

If you’re writing an agent that uses webpipe as a subsystem, the most reliable pattern is:

Start bounded: small max_results, max_urls, top_chunks, max_chars.
Branch on warnings:
- blocked_by_js_challenge → try fetch_backend="firecrawl" (if configured) or choose a different URL
- empty_extraction / main_content_low_signal → increase max_bytes/max_chars, or switch backend
Prefer urls-mode for determinism: once you have candidate URLs, pass urls=[...] and use url_selection_mode=query_rank.
Use cache to get repeatable behavior:
- warm cache with no_network=false, cache_write=true
- then evaluate with no_network=true to eliminate network flakiness

You can treat ok=false as “tool call failed” and use error.hint as the “next move”.

Environment variables (optional, provider-dependent):

Brave search: WEBPIPE_BRAVE_API_KEY (or BRAVE_SEARCH_API_KEY)
Tavily search: WEBPIPE_TAVILY_API_KEY (or TAVILY_API_KEY)
SearXNG search (self-hosted): WEBPIPE_SEARXNG_ENDPOINT (or WEBPIPE_SEARXNG_ENDPOINTS)
Firecrawl fetch: WEBPIPE_FIRECRAWL_API_KEY (or FIRECRAWL_API_KEY)
Perplexity deep research: WEBPIPE_PERPLEXITY_API_KEY (or PERPLEXITY_API_KEY)
ArXiv endpoint override (debug): WEBPIPE_ARXIV_ENDPOINT (default https://export.arxiv.org/api/query)

Safety defaults

webpipe drops request-secret headers by default:

Authorization
Cookie
Proxy-Authorization

If a caller supplies them, responses include a warning:

warning_codes contains unsafe_request_headers_dropped
request.dropped_request_headers lists header names only

To opt in (only for trusted endpoints), set WEBPIPE_ALLOW_UNSAFE_HEADERS=true.

Quick test

cargo test -p webpipe-mcp --features stdio

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
archive		archive
crates		crates
scripts		scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
SECURITY.md		SECURITY.md
mcp.example.json		mcp.example.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

webpipe

Avoid Cargo build locks (dev ergonomics)

Workspace layout

Install + run

Cursor MCP setup

Two audiences (how to think about the tools)

Tool cookbook (Cursor / MCP users)

`webpipe_meta` (what’s configured)

`web_search_extract` (the main “do the job” tool)

`web_extract` (single URL, extraction-first)

`web_fetch` (single URL, bytes/text-first)

`web_search` (search only)

`web_cache_search_extract` (no network; cache corpus only)

`web_deep_research` (evidence gather + optional synthesis)

`arxiv_search` / `arxiv_enrich`

Tool cookbook (agent/workflow authors)

Safety defaults

Quick test

About

Licenses found

Uh oh!

Releases

Packages

Languages

License

Licenses found

arclabs561/webpipe

Folders and files

Latest commit

History

Repository files navigation

webpipe

Avoid Cargo build locks (dev ergonomics)

Workspace layout

Install + run

Cursor MCP setup

Two audiences (how to think about the tools)

Tool cookbook (Cursor / MCP users)

webpipe_meta (what’s configured)

web_search_extract (the main “do the job” tool)

web_extract (single URL, extraction-first)

web_fetch (single URL, bytes/text-first)

web_search (search only)

web_cache_search_extract (no network; cache corpus only)

web_deep_research (evidence gather + optional synthesis)

arxiv_search / arxiv_enrich

Tool cookbook (agent/workflow authors)

Safety defaults

Quick test

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`webpipe_meta` (what’s configured)

`web_search_extract` (the main “do the job” tool)

`web_extract` (single URL, extraction-first)

`web_fetch` (single URL, bytes/text-first)

`web_search` (search only)

`web_cache_search_extract` (no network; cache corpus only)

`web_deep_research` (evidence gather + optional synthesis)

`arxiv_search` / `arxiv_enrich`

Packages