OpenScout is a developer-focused Retrieval-Augmented Generation (RAG) [Perplexity Type Architecture] with MCP integration and Neo4j adapter.
It answers natural-language questions by searching the web, fetching and extracting page content, chunking and embedding passages, storing vectors in FAISS, retrieving the most relevant passages, reranking them, and synthesizing a concise, cited answer using a pluggable LLM adapter.
High-level components and runtime flow:
- UI (Streamlit) — accepts user queries, handles API keys (BYOK) in the sidebar, and displays chat-style answers and sources.
- MCP layer (
core/mcp) — optional: when configured, the app calls a central MCP server for search/extract/cypher; otherwise it falls back to local SDKs (Tavily/Neo4j). - Graph pipeline (
core/graph.py) — orchestrates nodes: search → fetch → index → retrieve. It returns retrieved hits for synthesis. - Fetcher (
core/fetch.py) — downloads pages and extracts text (httpx + trafilatura). - Chunking + Embeddings (
core/chunk.py,core/embed.py) — split text into passages and compute embeddings (OpenAI by default). - Vector store (
core/faiss_store.py) — FAISS index (IndexIDMap + IndexFlatIP) + SQLite metadata for passages. - Reranker (
core/rerank.py) — optional cross-encoder for higher precision. - Synthesizer (
core/synthesize.py) — builds the prompt from top passages and calls the selected LLM adapter (supports streaming when available).
Runtime flow: user query → search (MCP or Tavily) → fetch pages → chunk & embed → index/store → retrieve top passages → (rerank) → LLM synthesize → UI.
Key files and their purposes:
app.py— Streamlit entrypoint; UI, BYOK handling, per-provider key tests, chat history, and orchestration of the graph.core/graph.py— LangGraph state graph wiring the main pipeline nodes (search, fetch, index, retrieve) and binding the synthesizer.core/mcp/adapters.py— MCPTools adapter: calls a remote MCP server when configured (MCP_URL) or falls back to local SDKs (Tavily/Neo4j).core/search.py— Local Tavily search wrapper with clearer error messages.core/fetch.py— Async fetcher using httpx and content extraction via trafilatura.core/chunk.py— Text chunking logic for splitting pages into passage-sized chunks.core/embed.py— OpenAI embedding wrapper (accepts explicit key or usesOPENAI_API_KEYenv fallback).core/faiss_store.py— FAISS index management (creates/wraps IndexIDMap) and SQLite metadata storage for chunks.core/rerank.py— Cross-encoder-based reranker using sentence-transformers (optional).core/llm/— LLM adapters and registry (openai_llm.py,anthropic_llm.py,gemini_llm.py,groq_llm.py,registry.py).core/synthesize.py— Builds the prompt from retrieved passages and performs synthesis via the LLM adapter.
UI screenshots from a sample run (click to view full size):
Quick steps to run OpenScout locally (PowerShell commands). These instructions assume you have Python 3.10+ installed.
Tip: For interactive local use, paste your LLM and provider API keys into the Streamlit sidebar when the app runs — this keeps secrets out of your repo. Creating a .env is optional and only recommended for persistent local defaults or CI.
- Clone or download the repository:
git clone <https://github.com/Ujjwal-Bajpayee/OpenScope.git>
cd OpenScout- Create and activate a virtual environment, then install dependencies. If you already have a
requirements.txtuse it; otherwise create one from your environment.
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
Note- First query may download ~90MB model- Obtain credentials:
- Neo4j: URI, username, and password (if you plan to use Neo4j for graph/cypher operations).
- Tavily:
TAVILY_API_KEYfor web search/enrichment (or configure MCP server instead). - LLM provider key: one of
OPENAI_API_KEY,GROQ_API_KEY,ANTHROPIC_API_KEY, orGOOGLE_API_KEY(Gemini).
- Create a
.envfile in the project root with the credentials (example) — OR, for interactive use, paste keys into the Streamlit sidebar fields at runtime (preferred for local testing). The sidebar stores keys in-session;.envis optional and mainly useful for CI or when you want persistent local defaults.
TAVILY_API_KEY=tvly...
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=yourpassword
OPENAI_API_KEY=sk-... # or GROQ_API_KEY=grq-... , ANTHROPIC_API_KEY=..., GOOGLE_API_KEY=...
# Optional: MCP_URL and MCP_API_KEY if using a remote MCP server
# MCP_URL=https://your-mcp.server
# MCP_API_KEY=...
- Start the Streamlit app:
streamlit run .\app.py- In the app sidebar:
- Paste any missing keys into the corresponding fields (they will be stored in-session).
- Optionally click the per-provider "Test key" buttons to validate connectivity.
- Ask a question in the chat input. The app will search, fetch pages, index passages, and synthesize a cited answer.
Notes and troubleshooting:
- If you see import errors for packages like
faiss,torch, ortavily, install the correct OS-specific wheels or usefaiss-cpufor most local dev setups. - If FAISS index load fails due to index type, delete the existing
faiss_index.binto allow rebuild, or run a migration script (not included). - Keep secrets out of commits: ensure
.envis in.gitignore.



