-
Notifications
You must be signed in to change notification settings - Fork 0
Local Inference
BRYAN DAVID WHITE edited this page Feb 23, 2026
·
5 revisions
Run LLM-based knowledge extraction on any OpenAI-compatible local server — llama.cpp, Ollama, vLLM, LocalAI, text-gen-webui, and more. No cloud API keys required.
Source: src/adapters/local_llm/connector.py, src/adapters/local_llm/exhaust.py
- Airgapped / sovereign — no data leaves your network
- Cost control — zero per-token API costs
- Low latency — GPU on the same machine or LAN
- Dev iteration — iterate on extraction prompts without burning API credits
pip install -e ".[local]"
# Start any OpenAI-compatible server, e.g.:
./llama-server -m models/llama-3-8b.Q4_K_M.gguf --port 8080
# Configure
export DEEPSIGMA_LLM_BACKEND=local
export DEEPSIGMA_LOCAL_BASE_URL=http://localhost:8080
export EXHAUST_USE_LLM=1| Variable | Default | Description |
|---|---|---|
DEEPSIGMA_LLM_BACKEND |
anthropic |
Set to local for local inference |
DEEPSIGMA_LOCAL_BASE_URL |
http://localhost:8080 |
Server URL |
DEEPSIGMA_LOCAL_API_KEY |
(empty) | Bearer token if server requires auth |
DEEPSIGMA_LOCAL_MODEL |
(empty) | Model name; empty = server default |
DEEPSIGMA_LOCAL_TIMEOUT |
120 |
HTTP timeout (seconds) |
from adapters.local_llm import LlamaCppConnector
connector = LlamaCppConnector()
print(connector.health())
result = connector.chat([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the decision."},
])
print(result["text"])When DEEPSIGMA_LLM_BACKEND=local and EXHAUST_USE_LLM=1, the exhaust refiner routes LLM extraction through the local server automatically — no code changes needed.
from adapters.local_llm import LlamaCppConnector
from adapters.local_llm.exhaust import LocalLLMExhaustAdapter
connector = LlamaCppConnector()
adapter = LocalLLMExhaustAdapter(connector, project="my-project")
result = adapter.chat_with_exhaust([{"role": "user", "content": "Key risks?"}])| Server | Notes |
|---|---|
llama.cpp (llama-server) |
Reference implementation |
| Ollama | DEEPSIGMA_LOCAL_BASE_URL=http://localhost:11434 |
| vLLM | OpenAI-compatible mode |
| LocalAI | Drop-in OpenAI replacement |
| text-generation-webui | Enable --api flag |
- Default backend remains
anthropic— zero changes to existing deployments -
EXHAUST_USE_LLM=1remains the master on/off switch -
ANTHROPIC_API_KEYonly required when backend isanthropic
- Exhaust Inbox — Full extraction pipeline docs
- Snowflake — Cortex AI connector (similar pattern)
- AskSage — AskSage connector + exhaust adapter
Full documentation: docs/30-local-inference.md
Σ OVERWATCH — Coherence Ops Platform • Current release: v2.1.0 • DeepSigma
- Start
- Core
- Schemas
- FEEDS + Exhaust
- Integrations
- Reference Layer
- Ops
- Excel-First
- EDGE + ABP
- Domain Modes
- Governance
- Meta