Skip to content

Use RAG to provide best practices context from official docs #73

@felipefernandes

Description

@felipefernandes

🎯 Goal

Provide LLM with official documentation context via RAG to improve accuracy and reduce false positives.

📊 Complexity

Long Term (3-5 days)

🔍 Problem

The LLM doesn't have access to:

  • GitHub Actions best practices documentation
  • Python security guidelines (OWASP)
  • Language-specific conventions
  • Framework-specific patterns

This leads to incorrect assumptions and false positives.

✅ Solution

Architecture

┌─────────────────┐
│  Review Request │
└────────┬────────┘
         │
         ▼
┌─────────────────┐     ┌──────────────────┐
│ Query Analyzer  │────▶│ RAG Document DB  │
│ (extract topic) │     │ (LanceDB)        │
└────────┬────────┘     └──────────────────┘
         │                       │
         │              ┌────────▼─────────┐
         │              │ Relevant Docs:   │
         │              │ - GH Actions     │
         │              │ - OWASP Python   │
         └──────────────▶ - Error Handling │
                        └────────┬─────────┘
                                 │
                        ┌────────▼─────────┐
                        │  LLM Review      │
                        │  + Context       │
                        └──────────────────┘

Implementation

1. Document Indexing

# iara/memory/docs_indexer.py

OFFICIAL_DOCS = {
    "github_actions": "https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions",
    "python_security": "https://owasp.org/www-project-top-ten/",
    "python_error_handling": "https://docs.python.org/3/tutorial/errors.html",
    # ... more sources
}

def index_official_docs():
    """Download and index official documentation."""
    for name, url in OFFICIAL_DOCS.items():
        content = fetch_and_parse(url)
        chunks = chunk_document(content)
        store_in_lancedb(name, chunks)

2. Context Retrieval

# iara/reviewer.py

def review_code(diff, api_key, config):
    # Extract topics from diff
    topics = extract_topics(diff)  # e.g., ["github_actions", "secrets"]
    
    # Retrieve relevant docs
    context_docs = retrieve_docs(topics, top_k=3)
    
    # Add to system prompt
    system_prompt = generate_system_prompt(config, context_docs)
    
    # Review with enhanced context
    return review_code_with_model(diff, api_key, model, system_prompt, provider)

3. Topic Extraction

def extract_topics(diff):
    """Extract topics from diff for targeted doc retrieval."""
    topics = set()
    
    if ".github/workflows" in diff:
        topics.add("github_actions")
    if "os.chmod" in diff or "secrets" in diff:
        topics.add("security")
    if "try:" in diff and "except" in diff:
        topics.add("error_handling")
    
    return list(topics)

📝 Implementation Steps

  1. Create iara/memory/docs_indexer.py
  2. Define curated list of official documentation sources
  3. Implement document fetching and chunking
  4. Index docs into LanceDB (reuse existing RAG infrastructure)
  5. Add topic extraction from diffs
  6. Implement context retrieval in reviewer.py
  7. Update system prompt to include doc context
  8. Add caching to avoid re-indexing
  9. Test with known false positive cases
  10. Measure improvement in accuracy

🎁 Expected Impact

  • 60-80% reduction in false positives
  • LLM decisions backed by official sources
  • More authoritative and trustworthy reviews
  • Can cite specific documentation

⚠️ Challenges

  • Keeping docs up to date
  • Balancing context size vs. relevance
  • Initial indexing time
  • Storage requirements

🔗 Related


Long term due to infrastructure requirements and doc curation effort.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions