Sentinel

LLM decision and audit layer for cost optimization

Problem

Companies make thousands of similar LLM API calls without visibility or control, burning money on duplicate work with no way to measure or optimize it.

Solution

Sentinel sits between applications and LLM providers, deciding whether responses can be reused based on semantic similarity. Every decision is logged with full explainability.

Key Features

Semantic similarity matching with tunable threshold (default: 0.85)
Decision logging and audit trail
Cost tracking and optimization metrics
Provider-agnostic (works with any OpenAI-compatible API)
Conservative by default (prioritizes correctness over aggressive caching)

Quick Start

# Install dependencies
pip install -r requirements.txt

# Start Ollama (or configure your LLM provider)
ollama serve

# Run Sentinel
python -m sentinel

Model Configuration

Sentinel works with any OpenAI-compatible endpoint.

Local (Ollama):

export LLM_BASE_URL="http://localhost:11434/v1"
export LLM_MODEL="llama3.2:1b"

Production (OpenAI):

export LLM_BASE_URL="https://api.openai.com/v1"
export LLM_MODEL="gpt-4o-mini"
export LLM_API_KEY="sk-..."

The decision logic, caching, and audit layer remain identical.

Design Decisions

Threshold: 0.85 Empirically tested across 0.80-0.95 range. At 0.90, system missed legitimate duplicates. At 0.80, false positive risk increased. 0.85 balances safety and effectiveness with clear separation from unrelated queries.

TTL: 1 hour Treats cache lifetime as confidence signal. Configurable per deployment based on data freshness requirements.

Never-cache keywords Time-sensitive queries ("current", "now", "today", "latest") explicitly bypass cache regardless of similarity.

Metrics

Test results from 27 realistic queries:

Cache hit rate: 14.8%
Cached latency: 2.7s
API latency: 66s
Speedup: 24.8x

Example cache hit: "I can't remember my password" matched "I forgot my password" with 0.852 similarity (just above 0.85 threshold).

Architecture

Application → Sentinel → LLM Provider
                ↓
          Decision Log (SQLite)

Request flow:

Check never-cache rules
Generate embedding, search cache (similarity ≥ threshold)
If hit: return cached + log decision
If miss: call LLM, cache response, log decision

Endpoints

POST /v1/chat/completions - Proxy LLM requests with caching GET /metrics - Cache and cost metrics GET /health - Health check

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
__pycache__		__pycache__
sentinel		sentinel
README.md		README.md
requirements.txt		requirements.txt
test_similarity.py		test_similarity.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentinel

Problem

Solution

Key Features

Quick Start

Model Configuration

Design Decisions

Metrics

Architecture

Endpoints

License

About

Uh oh!

Releases

Packages

Languages

nife-codes/llm-sentinel

Folders and files

Latest commit

History

Repository files navigation

Sentinel

Problem

Solution

Key Features

Quick Start

Model Configuration

Design Decisions

Metrics

Architecture

Endpoints

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages