OnCallAI is an AI-powered incident triage and root cause analysis prototype for DevOps and SRE workflows. It ingests incidents, gathers surrounding context, analyzes logs, records agent activity, and produces a structured incident report through a lightweight Streamlit interface.
This project is designed as an explainable, hackathon-friendly foundation for building an autonomous on-call assistant. The current implementation focuses on local execution, deterministic workflows, and a clear architecture that can be extended with real alert sources, richer retrieval, and production-grade orchestration.
Modern on-call teams lose time switching between alerts, logs, dashboards, and tribal knowledge. OnCallAI aims to shorten that path by centralizing the first-response workflow:
- Accept an incident record.
- Collect relevant operational context.
- Analyze available logs and evidence.
- Generate an RCA-style summary with suggested mitigations.
- Expose the full processing trail in a transparent UI.
- Incident intake backed by SQLite for simple local development.
- Step-by-step execution tracking for collector, analyst, and supervisor stages.
- Retrieval-assisted log analysis that combines heuristic rules with example-based incident context.
- Downloadable incident reports in both JSON and Markdown formats.
- Streamlit dashboard for filtering incidents, reviewing agent timelines, and inspecting generated reports.
- Environment-based configuration for polling, models, logging mode, and optional cloud integrations.
OnCallAI follows a simple agent-inspired pipeline:
runnerpolls for incidents withOPENstatus.- The collector stage retrieves context and logs for the incident.
- The analyst stage evaluates evidence and drafts findings.
- The supervisor stage writes the final report and marks the incident complete.
- The UI reads the persisted data and displays incident state, steps, and outputs.
app/runner.py: Main polling loop and incident execution flow.app/db/dal.py: Database access layer for incidents, steps, and reports.app/agents/collector_agent.py: Log selection and retrieval logic.app/agents/analyst_agent.py: Retrieval-assisted analysis and mitigation generation.app/agents/supervisor.py: Report compilation and workflow completion.ui/streamlit_app.py: Operator-facing incident dashboard.app/db/schema.sql: SQLite schema for incidents, agent steps, and reports.tests/: Lightweight unit and flow tests usingunittest.
OnCallAI/
├── app/
│ ├── agents/ # Collector, analyst, and supervisor logic
│ ├── db/ # Schema and data access layer
│ ├── middleware/ # CloudWatch and polling adapters
│ ├── rag/ # RAG-related stubs and loaders
│ ├── config.py # Environment-driven configuration
│ └── runner.py # Main incident processing loop
├── rag_pipeline/ # Experimental retrieval pipeline components
├── scripts/ # Seeding and local setup helpers
├── tests/ # Lightweight unit and incident-flow tests
├── ui/ # Streamlit application
├── Makefile
├── requirements.txt
└── README.md
- Python 3
- Streamlit
- SQLite
- SQLAlchemy
- LangChain and LangGraph dependencies for future orchestration expansion
- Chroma / FAISS / Pinecone libraries for retrieval experimentation
- Optional OpenAI and AWS integrations via environment configuration
python3 -m venv .venv
source .venv/bin/activatepip install -r requirements.txtcp .env.example .envAt minimum, review these values in .env:
POLL_INTERVAL_SECONDSDB_URLLOGS_LOCAL_ROOTUSE_REAL_CLOUDWATCHOPENAI_API_KEYif you plan to extend the project with hosted models
make seedThis populates sample incident data and uses the bundled log examples already stored in the repository.
make runIn a separate terminal:
make uimake testmake seed # Seed sample data
make run # Start incident polling and processing
make ui # Launch the Streamlit app
make test # Run the unit and incident-flow test suite
make clean # Remove the local SQLite db and generated reportsThe project is configured primarily through environment variables.
ENV: Environment name, defaultdevPOLL_INTERVAL_SECONDS: Runner poll interval
OPENAI_API_KEY: Optional API key for future LLM-backed flowsEMBEDDINGS_MODEL: Embedding model identifierLLM_MODEL: Chat model identifierVECTOR_BACKEND: Retrieval backend, such aschromaorfaiss
DB_URL: SQLAlchemy-style database URL for external integrationsDB_FILE: SQLite file used by the local DAL, defaults todev.db
LOGS_MODE:localors3LOGS_LOCAL_ROOT: Path to bundled local logsUSE_REAL_CLOUDWATCH: Enables real CloudWatch integration when set totrueCLOUDWATCH_LOG_GROUP: CloudWatch log group name
The current demo path is intentionally simple and transparent:
- Incidents are stored in SQLite.
- The runner picks up incidents with
OPENstatus. - The runner dispatches the incident through collector, analyst, and supervisor stages.
- The analyst combines rule-based matching with retrieved examples from the bundled incident corpus.
- Agent steps are written back to the database as the incident is processed.
- Reports are stored in structured JSON plus Markdown.
- The UI reads directly from the database and lets you inspect filters, timelines, payloads, and final reports.
This makes the project easy to demo, debug, and extend locally.
This repository is best understood as a strong prototype rather than a production-ready incident response platform.
- The default runner currently uses a simplified processing path.
- The analyst uses lightweight local retrieval rather than a full production retrieval pipeline or LLM-backed reasoning engine.
- Cloud integrations are present as stubs or optional extensions.
- Authentication, authorization, retries, and multi-tenant concerns are not implemented.
- The UI is optimized for local inspection and demos rather than operational scale.
Good next steps for evolving OnCallAI include:
- Connect real alert providers such as CloudWatch, PagerDuty, or Opsgenie.
- Replace heuristic analysis with retrieval-backed or model-backed reasoning.
- Add incident enrichment from dashboards, deploy metadata, and service ownership data.
- Introduce alert deduplication and correlation across multiple signals.
- Add automated remediation suggestions with human approval gates.
- Expand the UI into a richer operations console with filtering and search.
OnCallAI can be positioned as:
- A hackathon project focused on agentic operations tooling.
- A portfolio project demonstrating AI-assisted incident workflows.
- A prototype for internal SRE automation experiments.
- A foundation for future RCA copilots and on-call support systems.
If you are iterating on the project, a practical workflow is:
- Create or seed incidents.
- Run the processor locally.
- Inspect output in the Streamlit UI.
- Improve collection, analysis, or report generation logic.
- Re-run with fresh sample data.
Keep changes small and test the runner and UI together when modifying core incident flow.
This repository includes a LICENSE file at the root. Review it before external reuse or distribution.