Skip to content

vinay9986/DeepFetch

Repository files navigation

DeepFetch

DeepFetch is a semantic web search MCP server for Claude Desktop, Gemini CLI, Codex CLI, and other Model Context Protocol clients. It combines Kagi discovery, Scrapfly extraction, local ONNX reranking, and PDF-aware retrieval so agents get evidence-rich snippets instead of raw link lists.

Search terms: MCP search server, semantic web search, Model Context Protocol, Claude Desktop search tool, Gemini CLI MCP server, Codex CLI search, PDF search, Kagi, Scrapfly, ONNX reranking.

Why DeepFetch

  • Return evidence, not just URLs. internet_search reranks extracted page content and returns the strongest snippets from unique domains.
  • Stay compatible with the MCP clients people already use. The default path is local stdio, which works well for Docker-based local servers.
  • Handle PDFs as first-class sources. Search results that resolve to PDFs are routed through the PDF pipeline automatically, and pdf_extract_text is available when you already know the document URL.
  • Keep the deployment path simple. End users only need Docker plus KAGI_API_KEY and SCRAPFLY_API_KEY.
  • Preserve a clean upgrade path. The same FastMCP server can also run over streamable-http for managed deployments.

Quick Start

Run the server locally in Docker:

docker run --rm -i \
  -e KAGI_API_KEY=your_kagi_key \
  -e SCRAPFLY_API_KEY=your_scrapfly_key \
  ghcr.io/vinay9986/deepfetch:latest

Then point your MCP client at the containerized server using one of the config examples in examples/clients.

For maintainers, .github/workflows/publish-image.yml publishes the multi-arch image to ghcr.io/vinay9986/deepfetch from GitHub Actions on the default branch and release tags.

If you want to test the repo before publishing an image, build it locally and use the direct MCP smoke client from docs/getting-started.md.

Tool Surface

Tool Purpose Best fit
internet_search Discover, fetch, and rerank current public-web content. Time-sensitive facts, current events, source-backed lookup, and public web research.
pdf_extract_text Extract text and page-numbered matches from a known PDF. Reports, filings, papers, manuals, and PDF verification workflows.

Architecture Snapshot

flowchart LR
    Client[Claude Desktop / Gemini CLI / Codex CLI / other MCP client]
    Client --> Transport[FastMCP transport<br/>stdio or streamable-http]
    Transport --> Search[internet_search]
    Transport --> PDF[pdf_extract_text]
    Search --> Kagi[Kagi search API]
    Search --> Scrapfly[Scrapfly extraction]
    Search --> ONNX[Local ONNX embedder]
    Search --> PDF
    PDF --> PyPDF[pypdf page extraction]
    PDF --> ONNX
Loading

At runtime, DeepFetch deduplicates hosts before scraping, uses bounded parallel fetches, keeps short-lived in-process caches for Kagi and Scrapfly responses, and falls back to keyword snippets when semantic assets are unavailable.

Docs Map

  • Getting Started: Docker, local smoke testing, source installs, and test commands.
  • Architecture: transport model, request flow, module layout, and deployment choices.
  • Configuration: environment variables, transport knobs, semantic asset paths, and client configs.
  • Tool Examples: concrete call_tool payloads for both exposed tools.
  • ADR 0001: local-first, multi-transport rationale.

Status

DeepFetch currently exposes two MCP tools from src/deepfetch/server.py:

  • internet_search
  • pdf_extract_text

The primary distribution model is a Docker image with ONNX assets baked in. Local stdio is the default mode, and DEEPFETCH_TRANSPORT=streamable-http enables the managed deployment path.

About

Semantic web search MCP server with Kagi, Scrapfly, PDF extraction, and local ONNX reranking for Claude Desktop, Gemini CLI, and Codex.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors