LLM decision and audit layer for cost optimization
Companies make thousands of similar LLM API calls without visibility or control, burning money on duplicate work with no way to measure or optimize it.
Sentinel sits between applications and LLM providers, deciding whether responses can be reused based on semantic similarity. Every decision is logged with full explainability.
- Semantic similarity matching with tunable threshold (default: 0.85)
- Decision logging and audit trail
- Cost tracking and optimization metrics
- Provider-agnostic (works with any OpenAI-compatible API)
- Conservative by default (prioritizes correctness over aggressive caching)
# Install dependencies
pip install -r requirements.txt
# Start Ollama (or configure your LLM provider)
ollama serve
# Run Sentinel
python -m sentinelSentinel works with any OpenAI-compatible endpoint.
Local (Ollama):
export LLM_BASE_URL="http://localhost:11434/v1"
export LLM_MODEL="llama3.2:1b"Production (OpenAI):
export LLM_BASE_URL="https://api.openai.com/v1"
export LLM_MODEL="gpt-4o-mini"
export LLM_API_KEY="sk-..."The decision logic, caching, and audit layer remain identical.
Threshold: 0.85 Empirically tested across 0.80-0.95 range. At 0.90, system missed legitimate duplicates. At 0.80, false positive risk increased. 0.85 balances safety and effectiveness with clear separation from unrelated queries.
TTL: 1 hour Treats cache lifetime as confidence signal. Configurable per deployment based on data freshness requirements.
Never-cache keywords Time-sensitive queries ("current", "now", "today", "latest") explicitly bypass cache regardless of similarity.
Test results from 27 realistic queries:
- Cache hit rate: 14.8%
- Cached latency: 2.7s
- API latency: 66s
- Speedup: 24.8x
Example cache hit: "I can't remember my password" matched "I forgot my password" with 0.852 similarity (just above 0.85 threshold).
Application → Sentinel → LLM Provider
↓
Decision Log (SQLite)
Request flow:
- Check never-cache rules
- Generate embedding, search cache (similarity ≥ threshold)
- If hit: return cached + log decision
- If miss: call LLM, cache response, log decision
POST /v1/chat/completions - Proxy LLM requests with caching GET /metrics - Cache and cost metrics GET /health - Health check
MIT