A production-style SRE sandbox that serves an LLM endpoint, injects failures, tracks SLOs, emits Prometheus metrics, and auto-heals on incidents.
- Demonstrate SRE fundamentals applied to inference endpoints
- Instrument SLIs: latency, availability, error rate, GPU utilization
- Run chaos tests and automated remediation
- AI-assisted development with human review and safety guardrails
- LLM Service: Ollama running
smollm - API Service: FastAPI wrapper with instrumentation
- Monitoring: Prometheus (metrics) + Grafana (visualization)
| SLI | Captured Metric |
|---|---|
| Latency | llm_request_latency_seconds_bucket (histogram) |
| Availability | llm_request_total + llm_request_errors_total |
| Error Rate | llm_request_errors_total |
| GPU Utilization | llm_gpu_utilization_percent (simulated) |
| Saturation | llm_inference_in_flight |
See docs/SLO.md
Alerts trigger container restarts via the remediation script.
-
Start the stack:
docker compose up --build -d
-
Run chaos experiments (optional):
python chaos.py
-
Run remediation (optional):
./remediate.sh
The project includes a pre-configured monitoring stack:
- Prometheus: http://localhost:9090
- Scrapes metrics from the API service and itself.
- Grafana: http://localhost:3000
- Login:
admin/admin - Dashboards: The "LLM Reliability (MVP)" dashboard is automatically provisioned and ready to view.
- Login:
To run the test suite (including infrastructure tests):
pip install -r requirements.txt
PYTHONPATH=. pytest- No secrets in plaintext
- Uses environment variables (see
.env.example) - Always key-rotate before production reuse