Skip to content

RAG-based code documentation assistant with grounded answers and citations

Notifications You must be signed in to change notification settings

mustdobetter/nexus

Repository files navigation

Nexus

RAG-based code documentation assistant with grounded answers and citations.

Overview

Nexus ingests a software repository and answers developer questions about the codebase. Every answer is grounded in retrieved source evidence with verifiable citations ([file:line-range]).

Key Features:

  • Ingest any local code repository into a searchable index
  • Ask natural language questions about the codebase
  • Get answers with precise file and line citations woven inline
  • Post-hoc citation validation rejects hallucinated answers
  • Structured logging across the full pipeline
  • Refuses to answer when evidence is insufficient

Quickstart

Prerequisites

  • Docker and Docker Compose
  • (Optional) NVIDIA GPU for faster local inference

Local Development

# Clone the repository
git clone <repo-url>
cd nexus

# Copy environment config
cp .env.example .env

# Start services (Ollama + ChromaDB + Nexus)
docker compose up -d

# Pull required models (first time only)
docker compose exec ollama ollama pull gpt-oss:20b
docker compose exec ollama ollama pull nomic-embed-text

# Index a repository
docker compose exec nexus nexus ingest /path/to/repo

# Ask a question
docker compose exec nexus nexus ask "Where is authentication implemented?"

Production

# Set OpenAI API key (via CI/CD or secrets manager)
export OPENAI_API_KEY=sk-...

# Start production services
docker compose -f docker-compose.prod.yml up -d

Usage

Ingest a Repository

nexus ingest /path/to/repository --collection my-project

Ask Questions

nexus ask "Where are API endpoints defined?"
nexus ask "How does the authentication flow work?"
nexus ask "What database is used and how is it configured?"

Example Output

Searching collection: my-project...
Found 6 relevant chunks
Generating answer...

The authentication flow is handled across two modules. The login function
validates credentials and issues JWT tokens [src/auth/login.py:45-92],
while session management handles token refresh and expiration
[src/auth/session.py:12-48]. Password hashing uses bcrypt
[src/auth/crypto.py:8-25].

Citations appear inline as [path/to/file.ext:start_line-end_line], directly next to the claims they support.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Docker Compose                         │
├─────────────┬─────────────┬─────────────┬──────────────────┤
│   Ollama    │  ChromaDB   │    Nexus    │   (Prod only)    │
│  LLM+Embed  │ Vector Store│     CLI     │   OpenAI API     │
└─────────────┴─────────────┴─────────────┴──────────────────┘

Ingestion Pipeline

  1. Walker (ingest/walker.py) — Traverse repository, apply .nexusignore rules
  2. Chunker (ingest/chunker.py) — Split files into chunks with line metadata
  3. Embedder (ingest/embedder.py) — Generate embeddings via Nomic Embed Text V2
  4. Index (ingest/index.py) — Store in ChromaDB with metadata

Query Pipeline

  1. Search (retrieval/search.py) — Embed question, retrieve top-k chunks from ChromaDB
  2. Context (retrieval/context_builder.py) — Build evidence block with citation headings
  3. Answer (llm/answer.py) — Generate grounded response via LLM
  4. Validate (llm/citation_validator.py) — Reject answer if any citation is hallucinated

Technical Decisions

Detailed rationale is documented in docs/DECISIONS.md (ADR-001 through ADR-007).

Chunking Strategy

  • Code files: 300 lines with 50-line overlap (line-based, preserves structure)
  • Text/Markdown: 1000 characters with 100-character overlap
  • All chunks store start_line and end_line for precise citations
  • Trade-off: does not respect semantic boundaries (functions, classes). AST-aware chunking deferred — see Future Improvements

Embedding Model

  • Nomic Embed Text V2 via Ollama — MoE architecture (475M params, 305M active), 768 dimensions
  • Requires instruction prefixes: search_document: for indexing, search_query: for retrieval
  • Omitting these prefixes significantly degrades retrieval quality
  • Local deployment, Apache 2.0 license, no external API dependency

Vector Store

  • ChromaDB in Docker — simple Python API, built-in metadata filtering
  • Persistent storage via Docker volumes
  • Trade-off: single-node limits scaling, but appropriate for current scale (thousands of chunks)

LLM

  • Dev: gpt-oss-20b via Ollama (local, fits in 16GB, Apache 2.0)
  • Prod: GPT-5.2 via OpenAI API (best-in-class grounding)
  • Single codebase — switches via OPENAI_BASE_URL environment variable
  • Temperature fixed at 0.1 for focused, deterministic answers

Retrieval

  • Vector-only similarity search (hybrid BM25+vectors deferred — see ADR-006)
  • Top-k=8 with max 2 chunks per file (diversity constraint)
  • L2 distance threshold of 1.5 filters irrelevant results
  • Over-fetches at 2x top_k to ensure enough results survive filtering

Citation Validation (Guardrails)

  • LLM instructed to use bracketed inline citations: [path/to/file.ext:start-end]
  • Post-hoc regex parsing validates every cited file path against retrieved context
  • Full rejection policy: if any citation references a file not in context, the entire answer is rejected — because a hallucinated citation implies the associated claim is also hallucinated (ADR-007)
  • File-path-only validation (no line range checking) avoids false rejections

Structured Logging

  • structlog with JSON and console output modes
  • Three logging points: search (query, results, latency), LLM call (model, latency, tokens), citation validation (pass/fail, invalid paths)
  • Configured via LOG_LEVEL and LOG_FORMAT environment variables

Trade-offs

Decision Chosen Alternative Why
Chunking Line-based (300 lines) AST-aware Simpler, language-agnostic. May split functions awkwardly.
Retrieval Vector-only Hybrid (BM25 + vectors) Simpler to implement. Keyword queries (exact function names) may underperform.
Citation handling Full rejection Strip invalid, keep answer Safer — hallucinated citation implies hallucinated claim. May reject otherwise useful answers.
Embedding Local (Ollama) Cloud API No external dependency, consistent dev/prod. Slightly lower quality than top cloud models.
Vector store ChromaDB FAISS / Milvus Simpler API, Docker-native. Fewer tuning options, single-node only.
LLM abstraction OpenAI-compatible API Framework (LangChain, etc.) Minimal dependency, direct control. No built-in chains or agents.

Productionisation

To deploy Nexus on a cloud platform:

Infrastructure

  • Compute: Container orchestration (ECS/Fargate, Cloud Run, or Kubernetes) for the Nexus CLI/API
  • Vector store: Managed ChromaDB (Chroma Cloud) or migrate to a managed alternative (Pinecone, Weaviate) for durability and scaling
  • LLM: OpenAI API via API gateway with rate limiting and key rotation
  • Embeddings: Continue using Nomic Embed Text V2, either self-hosted or via cloud endpoint

Operations

  • CI/CD: GitHub Actions pipeline — lint, test, build Docker image, push to registry, deploy
  • Monitoring: Structured JSON logs piped to a log aggregator (Datadog, CloudWatch). Alert on citation_validation_failed events, high LLM latency, or elevated error rates
  • Scaling: Nexus is stateless (state lives in ChromaDB) — horizontal scaling is straightforward. ChromaDB is the bottleneck; managed vector store solves this
  • Secrets: API keys injected via secrets manager (AWS Secrets Manager, GCP Secret Manager), never committed

Reliability

  • Health checks: /status endpoint (or extend existing nexus status command) for liveness/readiness probes
  • Retry logic: Already built into the embedder (exponential backoff). Add similar retry logic for LLM calls in production
  • Cost control: Token usage logging (already in place) enables cost attribution and budget alerting

Evaluation

An evaluation harness measures retrieval quality against the Nexus codebase:

# Run eval (requires Ollama + ChromaDB running with indexed collection)
python -m eval.run_eval --collection ai-assessment

This runs 20 questions through the full pipeline and reports evidence hit-rate: the percentage of questions where the answer cited at least one expected source file.

Results

Model Hit-rate Notes
GPT-5.2 (OpenAI API) 90% (18/20) Production target. Strong citation compliance.
llama3.1:8b (Ollama) 45% (9/20) Inconsistent citation formatting, needs smaller context window (top_k=4).

Failure analysis (GPT-5.2):

  • 1 retrieval miss — app/cli.py ranks low for "CLI entrypoint" because documentation files that describe the CLI are more semantically similar than the code itself. This is the known hybrid retrieval gap (vector-only search misses keyword matches).
  • 1 citation validation rejection — the LLM cited an example file path (src/auth.py) found in a docstring within the retrieved context. The guardrail correctly rejected it.

See eval/questions.json for the question set and expected evidence files.

AI Tooling

This project was built with the assistance of Claude Code (Anthropic's CLI coding agent).

How AI tools were used

  • Design: Collaborative brainstorming sessions to explore approaches and trade-offs before implementation. Design documents written iteratively with human review at each section.
  • Implementation: Subagent-driven development — each task dispatched to a fresh agent with full context, followed by two-stage review (spec compliance, then code quality).
  • Testing: TDD throughout — tests written before implementation, with the agent verifying failures before writing production code.
  • Code review: Automated spec compliance and code quality review after each task.

Quality controls

  • All code reviewed by human before committing
  • Every function has full type hints and docstrings
  • 167 automated tests covering all modules
  • ruff linting and formatting enforced
  • Manual verification against live services at each milestone

Future Improvements

With more time, the following would improve Nexus:

  • AST-aware chunking — Respect semantic boundaries (functions, classes) for popular languages. Would reduce split-function artifacts and improve retrieval precision.
  • Hybrid retrieval (BM25 + vectors) — Combine keyword search with semantic search. Would improve recall for exact-match queries like "where is authenticate called?".
  • Streaming answers — Stream LLM responses to the terminal as they're generated, rather than waiting for the complete response. Better UX for longer answers.
  • Web UI — FastAPI backend + simple frontend for browser-based Q&A. The pipeline is already structured for this (search → context → answer → validate).
  • Caching — Cache embeddings and frequent query results to reduce latency and API costs.
  • Multi-repo support — Index multiple repositories into separate collections with a unified query interface.
  • Confidence scoring — Surface the retrieval distance scores to help users gauge answer reliability.

Development

See AGENTS.md for development standards.

# Run tests
pytest

# Run linting
ruff check .

# Run evaluation
python -m eval.run_eval --collection ai-assessment

# Start services
docker compose up -d

License

MIT

About

RAG-based code documentation assistant with grounded answers and citations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published