Skip to content

Latest commit

 

History

History
99 lines (72 loc) · 4.03 KB

File metadata and controls

99 lines (72 loc) · 4.03 KB

DeepFetch Architecture

DeepFetch is a local-first MCP server that turns public-web search into evidence-rich snippets for agent runtimes.

Design Goals

  • Work cleanly with local MCP clients such as Claude Desktop, Gemini CLI, and Codex CLI.
  • Require only user-supplied KAGI_API_KEY and SCRAPFLY_API_KEY.
  • Prefer deterministic retrieval and local semantic reranking over an extra paid LLM pass.
  • Keep the default deployment model simple enough for local Docker, while still supporting a managed HTTP transport.

Runtime Topology

flowchart LR
    Client[Client]
    Client --> Transport[FastMCP transport]
    Transport --> Server[src/deepfetch/server.py]
    Server --> Search[internet_search]
    Server --> PDF[pdf_extract_text]

    Search --> Kagi[Kagi discovery]
    Search --> Fetch[Scrapfly extraction fan-out]
    Search --> Rank[ONNX snippet ranking]
    Search --> Keyword[Keyword fallback]
    Search --> PDF

    PDF --> URL[Public HTTPS validation and download]
    PDF --> Pages[pypdf page extraction]
    PDF --> PDFRank[Semantic or keyword page matching]
Loading

src/deepfetch/server.py is intentionally thin. It registers the two tools and hands transport startup to FastMCP. The real retrieval logic lives under src/deepfetch/search/.

Request Flow

internet_search

  1. Query Kagi for candidate URLs.
  2. Normalize hosts and keep the first candidate per host.
  3. Fetch candidate content in parallel with a bounded ThreadPoolExecutor.
  4. Prefer Scrapfly AI extraction when a supported extraction_model is supplied.
  5. Detect PDFs by URL or content type and route those candidates through pdf_extract_text.
  6. Rank snippets semantically with the shared ONNX embedder when assets are present.
  7. Fall back to keyword-centered snippets when semantic assets or semantic matches are unavailable.
  8. If the first pass does not yield enough unique hosts, issue a second Kagi query that excludes already-attempted hosts.

pdf_extract_text

  1. Accept exactly one source: url or pdf_base64.
  2. Validate public HTTPS URLs before downloading.
  3. Read the PDF with pypdf.
  4. Extract the requested page range.
  5. Run semantic page matching when the embedder is available, otherwise use keyword matching.
  6. Return page-numbered snippets plus search-mode metadata.

Module Layout

Path Responsibility
src/deepfetch/server.py FastMCP server creation, tool registration, transport startup.
src/deepfetch/search/internet_search.py Kagi discovery, Scrapfly extraction, host dedupe, reranking, PDF routing, and response shaping.
src/deepfetch/search/pdf_utils.py PDF download/decoding, page extraction, semantic and keyword matching.
src/deepfetch/search/http_utils.py Safe HTTP helpers and public-URL validation.
src/deepfetch/search/text_utils.py Snippet anchoring and text slicing helpers.
src/deepfetch/search/embedder.py ONNX embedder loading and vector generation.

Caching and Concurrency

  • Kagi responses are cached in-process for a short TTL.
  • Scrapfly text and AI extraction responses are cached in-process for a short TTL.
  • Search extraction uses bounded parallelism so one request does not fan out without limit.
  • No external cache or database sits on the core local stdio path.

Deployment Modes

Local-first

  • Transport: stdio
  • Packaging: Docker image
  • Secret model: the user injects KAGI_API_KEY and SCRAPFLY_API_KEY

This is the default because it matches how most local MCP clients launch servers today.

Managed

  • Transport: streamable-http
  • Packaging: long-running container service
  • Typical target: ECS Fargate or a similar always-on container platform

This path exists for remote-capable clients and hosted agent runtimes, but it is secondary to the local Docker workflow.

Non-Goals in the Current Codebase

  • No database-backed core retrieval path.
  • No multi-tenant credential brokering layer.
  • No Lambda-first deployment strategy.

The architectural rationale for those choices lives in ADR 0001.