Skip to content

nife-codes/llm-sentinel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sentinel

LLM decision and audit layer for cost optimization

Problem

Companies make thousands of similar LLM API calls without visibility or control, burning money on duplicate work with no way to measure or optimize it.

Solution

Sentinel sits between applications and LLM providers, deciding whether responses can be reused based on semantic similarity. Every decision is logged with full explainability.

Key Features

  • Semantic similarity matching with tunable threshold (default: 0.85)
  • Decision logging and audit trail
  • Cost tracking and optimization metrics
  • Provider-agnostic (works with any OpenAI-compatible API)
  • Conservative by default (prioritizes correctness over aggressive caching)

Quick Start

# Install dependencies
pip install -r requirements.txt

# Start Ollama (or configure your LLM provider)
ollama serve

# Run Sentinel
python -m sentinel

Model Configuration

Sentinel works with any OpenAI-compatible endpoint.

Local (Ollama):

export LLM_BASE_URL="http://localhost:11434/v1"
export LLM_MODEL="llama3.2:1b"

Production (OpenAI):

export LLM_BASE_URL="https://api.openai.com/v1"
export LLM_MODEL="gpt-4o-mini"
export LLM_API_KEY="sk-..."

The decision logic, caching, and audit layer remain identical.

Design Decisions

Threshold: 0.85 Empirically tested across 0.80-0.95 range. At 0.90, system missed legitimate duplicates. At 0.80, false positive risk increased. 0.85 balances safety and effectiveness with clear separation from unrelated queries.

TTL: 1 hour Treats cache lifetime as confidence signal. Configurable per deployment based on data freshness requirements.

Never-cache keywords Time-sensitive queries ("current", "now", "today", "latest") explicitly bypass cache regardless of similarity.

Metrics

Test results from 27 realistic queries:

  • Cache hit rate: 14.8%
  • Cached latency: 2.7s
  • API latency: 66s
  • Speedup: 24.8x

Example cache hit: "I can't remember my password" matched "I forgot my password" with 0.852 similarity (just above 0.85 threshold).

Architecture

Application → Sentinel → LLM Provider
                ↓
          Decision Log (SQLite)

Request flow:

  1. Check never-cache rules
  2. Generate embedding, search cache (similarity ≥ threshold)
  3. If hit: return cached + log decision
  4. If miss: call LLM, cache response, log decision

Endpoints

POST /v1/chat/completions - Proxy LLM requests with caching GET /metrics - Cache and cost metrics GET /health - Health check

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages