Skip to content

Local Inference

BRYAN DAVID WHITE edited this page Feb 23, 2026 · 5 revisions

Local Inference

Run LLM-based knowledge extraction on any OpenAI-compatible local server — llama.cpp, Ollama, vLLM, LocalAI, text-gen-webui, and more. No cloud API keys required.

Source: src/adapters/local_llm/connector.py, src/adapters/local_llm/exhaust.py


Why Local?

  • Airgapped / sovereign — no data leaves your network
  • Cost control — zero per-token API costs
  • Low latency — GPU on the same machine or LAN
  • Dev iteration — iterate on extraction prompts without burning API credits

Setup

pip install -e ".[local]"

# Start any OpenAI-compatible server, e.g.:
./llama-server -m models/llama-3-8b.Q4_K_M.gguf --port 8080

# Configure
export DEEPSIGMA_LLM_BACKEND=local
export DEEPSIGMA_LOCAL_BASE_URL=http://localhost:8080
export EXHAUST_USE_LLM=1

Environment Variables

Variable Default Description
DEEPSIGMA_LLM_BACKEND anthropic Set to local for local inference
DEEPSIGMA_LOCAL_BASE_URL http://localhost:8080 Server URL
DEEPSIGMA_LOCAL_API_KEY (empty) Bearer token if server requires auth
DEEPSIGMA_LOCAL_MODEL (empty) Model name; empty = server default
DEEPSIGMA_LOCAL_TIMEOUT 120 HTTP timeout (seconds)

Usage

Direct

from adapters.local_llm import LlamaCppConnector

connector = LlamaCppConnector()
print(connector.health())

result = connector.chat([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Summarize the decision."},
])
print(result["text"])

Exhaust Pipeline (automatic)

When DEEPSIGMA_LLM_BACKEND=local and EXHAUST_USE_LLM=1, the exhaust refiner routes LLM extraction through the local server automatically — no code changes needed.

Exhaust Adapter (manual)

from adapters.local_llm import LlamaCppConnector
from adapters.local_llm.exhaust import LocalLLMExhaustAdapter

connector = LlamaCppConnector()
adapter = LocalLLMExhaustAdapter(connector, project="my-project")
result = adapter.chat_with_exhaust([{"role": "user", "content": "Key risks?"}])

Tested Servers

Server Notes
llama.cpp (llama-server) Reference implementation
Ollama DEEPSIGMA_LOCAL_BASE_URL=http://localhost:11434
vLLM OpenAI-compatible mode
LocalAI Drop-in OpenAI replacement
text-generation-webui Enable --api flag

Backward Compatibility

  • Default backend remains anthropic — zero changes to existing deployments
  • EXHAUST_USE_LLM=1 remains the master on/off switch
  • ANTHROPIC_API_KEY only required when backend is anthropic

Related Pages

  • Exhaust Inbox — Full extraction pipeline docs
  • Snowflake — Cortex AI connector (similar pattern)
  • AskSage — AskSage connector + exhaust adapter

Full documentation: docs/30-local-inference.md

Clone this wiki locally