ATLAS - A production-ready agentic AI system that executes DevOps operations autonomously, with enterprise-grade approval workflows and audit logging
This is NOT just a framework or CLI tool. ATLAS is a complete, production-ready AI DevOps platform that:
β
Executes Real Operations - Doesn't just suggest kubectl commands, it runs them for you
β
Approves Dangerous Actions - Built-in approval workflow for destructive operations
β
Logs Everything - 30-day audit trail of all executions
β
Production Ready - Enterprise security, monitoring, backups from day 1
β
Self-Hosted - Complete control, no cloud lock-in
Before (Traditional Chat):
User: "Check pod status"
AI: "You can check pod status with: kubectl get pods -n production"
User: *copies and runs command manually*
Now (Agentic Execution):
User: "Check pod status"
ATLAS: "I'll check that for you!"
*Executes kubectl_get_pods*
"Here are your pods: backend-python (Running), frontend (Running)..."
User: *Gets results immediately*
- 26 Tools - Real DevOps operations execution, not just suggestions
- Self-Healing - Automatically restart failed pods and scale deployments
- Predictive Operations - Prevent issues before they occur with trend analysis
- Security Auto-Remediation - Scan and fix security issues automatically
- 3 Approval Modes - STRICT/NORMAL/AUTO for different security needs
- Tool Validation Layer - Automatic result validation after each operation
- Comprehensive Audit Trail - Track all executed operations with full context
- Explicit Reasoning - See agent's thinking process with
<think>and<plan>tags - Multi-Step Planning - Systematic approach to complex tasks
- Resource Efficiency Analysis - Identify over/under-provisioned resources
- Kubernetes - Full cluster management (13 tools)
- Predictive Analytics - Resource exhaustion prediction, pattern detection (4 tools)
- Security - Vulnerability scanning and automated fixes (2 tools)
- Docker, Git, Prometheus - Complete DevOps toolkit
- Audit Logging - Every execution logged with timestamps, user, status
- Tool Chaining - Multi-step operations executed automatically
- Next.js Frontend - Beautiful, modern UI with agent mode
- Python Backend - FastAPI with Claude API integration
- Rust Backend - High-performance service for CPU-intensive tasks
- Payload CMS - Headless CMS for content management
- Multiple Databases - PostgreSQL, Redis, RabbitMQ
- Pod Security Standards - Restricted level for production
- RBAC - Least privilege access control
- Network Policies - Default deny ingress/egress
- Security Scanning - Trivy, GitLeaks, Checkov, Semgrep in CI/CD
- Secrets Management - Infisical with Kubernetes operator
- Metrics - Prometheus + Thanos for long-term storage
- Logs - Loki + Promtail for centralized logging
- Dashboards - Grafana with pre-built dashboards (cluster, SLO, cost)
- Alerting - Unified alerting + Grafana OnCall
- SLO Tracking - 99.9% availability target with error budget
- Automated Backups - Velero for K8s, PostgreSQL backups every 6h
- Disaster Recovery - Complete runbook with RTO < 4h, RPO < 6h
- Auto-scaling - HPA for all services
- High Availability - Multi-replica deployments
- Frontend: Next.js (React)
- CMS: Payload CMS (headless)
- Backend: Python (FastAPI) + Rust (Actix)
- Database: PostgreSQL (HA)
- Cache: Redis Cluster
- Queue: RabbitMQ
- Storage: MinIO (S3-compatible)
- Orchestration: Self-hosted Kubernetes
- CI/CD: GitLab CI/CD
- Monitoring: Prometheus + Grafana + Loki + Thanos
- Secrets: Infisical (self-hosted)
- On-Call: Grafana OnCall
- Security: Trivy, GitLeaks, Checkov, Pod Security Standards
devops-agent/
βββ apps/ # π― Applications
β βββ backend-python/ # FastAPI + Claude + Agentic Engine β
β β βββ app/core/ # Execution engine, tools, executors
β βββ frontend/ # Next.js + Agent UI
β βββ payload-cms/ # Headless CMS
β βββ rust-backend/ # High-performance Rust service
β βββ database/ # PostgreSQL, Redis, RabbitMQ configs
βββ kubernetes/ # βΈοΈ Kubernetes Manifests
β βββ base/ # Namespaces, RBAC, NetworkPolicies
β βββ apps/ # Application deployments
β βββ backup/ # Velero + PostgreSQL backups β
βββ terraform/ # ποΈ Infrastructure as Code
β βββ main.tf # Main configuration
β βββ modules/ # Reusable modules
βββ monitoring/ # π Observability Stack
β βββ prometheus/ # Metrics + SLO alerts β
β βββ grafana/ # Dashboards (cluster, SLO, cost) β
β βββ loki/ # Log aggregation
βββ security/ # π Security
β βββ infisical/ # Secrets management
βββ scripts/ # π€ Automation
β βββ maintenance/ # Cleanup, maintenance tasks β
βββ docker/ # π³ Dockerfiles
β βββ Dockerfile.python # Optimized multi-stage
β βββ Dockerfile.nextjs # Production Next.js
β βββ Dockerfile.rust # Optimized Rust build
βββ docs/ # π Documentation
β βββ runbooks/ # Deployment, rollback, DR β
β βββ architecture/ # System architecture
βββ .gitlab-ci.yml # CI/CD pipeline with security
βββ AGENTIC_FEATURES.md # β Agent features guide
βββ MCP_RESEARCH.md # β MCP servers research
βββ DEPLOYMENT_GUIDE.md # Full deployment guide
βββ QUICK_START_LOCAL.md # 5-minute local setup
β = New/Major components
Perfect for testing and development:
# 1. Clone repository
git clone https://github.com/kosiorkosa47/devops-agent.git
cd devops-agent
# 2. Start databases with Docker
docker run -d --name postgres -p 5432:5432 \
-e POSTGRES_USER=devops \
-e POSTGRES_PASSWORD=devops123 \
-e POSTGRES_DB=devops_agent \
postgres:16-alpine
docker run -d --name redis -p 6379:6379 redis:7-alpine
# 3. Backend (Python + Claude)
cd apps/backend-python
pip install poetry
poetry install
cp .env.example .env
# Add your ANTHROPIC_API_KEY to .env
poetry run python -m app.main
# 4. Frontend (Next.js)
cd apps/frontend
npm install
cp .env.example .env.local
npm run dev
# 5. Open http://localhost:3000
# 6. Switch to Agent Mode π§
# 7. Try: "List all pods in production namespace"See QUICK_START_LOCAL.md for detailed local setup.
Production-ready deployment:
# 1. Clone repository
git clone https://github.com/kosiorkosa47/devops-agent.git
cd devops-agent
# 2. Configure environment
cp apps/backend-python/.env.example apps/backend-python/.env
# Add ANTHROPIC_API_KEY and other secrets
# 3. Deploy infrastructure
kubectl apply -k kubernetes/base/
kubectl apply -f apps/database/
# 4. Deploy applications
kubectl apply -f kubernetes/apps/
# 5. Install monitoring
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring -f monitoring/prometheus/values.yaml
# 6. Install backups
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm install velero vmware-tanzu/velero \
-n velero -f kubernetes/backup/velero-values.yaml
# 7. Access application
kubectl get ingress -n productionSee DEPLOYMENT_GUIDE.md for complete deployment instructions.
- Grafana: https://grafana.yourdomain.com
- Prometheus: https://prometheus.yourdomain.com
- Infisical: https://secrets.yourdomain.com
- Availability: 99.9% target
- Latency: P95 < 200ms, P99 < 500ms
- Error Rate: < 0.1%
- Resource Usage: CPU < 70%, Memory < 80%
β Pod Security Standards (restricted level) β RBAC with least privilege β Network Policies (default deny) β Container image scanning (Trivy) β Secrets management (Infisical) β TLS everywhere β Security scanning in CI/CD β Audit logging
- GDPR considerations
- Security best practices (CIS Kubernetes Benchmark)
- Regular security audits
# Deploy to dev
gitlab-ci-multi-runner exec docker deploy-dev
# Deploy to staging
gitlab-ci-multi-runner exec docker deploy-staging
# Deploy to production (requires approval)
# Use GitLab UI for production deployments# Quick rollback
kubectl rollout undo deployment/<app-name> -n production
# Or use GitLab CI rollback job# Check cluster health
kubectl get pods -A
kubectl top nodes
kubectl top pods -A
# Check recent alerts
curl http://alertmanager:9093/api/v1/alerts- AGENTIC_FEATURES.md β - Complete guide to agent capabilities
- QUICK_START_LOCAL.md - 5-minute local development setup
- DEPLOYMENT_GUIDE.md - Full Kubernetes deployment
- MCP_RESEARCH.md - MCP servers integration research
- Deployment Runbook - Production deployment procedures
- Rollback Runbook - Emergency rollback procedures
- Disaster Recovery - DR procedures (RTO < 4h)
- Architecture Overview - System design and components
- PROJECT_COMPLETION_REPORT.md - Full implementation report
β ALL SPRINTS COMPLETE - 100% Production Ready
- β Sprint 1: Core Infrastructure (Terraform, K8s, CI/CD, MinIO)
- β Sprint 2: Monitoring & Observability (Prometheus, Grafana, Loki)
- β Sprint 3: Security & Compliance (Scanning, Infisical, Pod Security)
- β Sprint 4: Applications (Next.js, Python, Rust, Payload CMS)
- β Sprint 5: Data Layer (PostgreSQL, Redis, RabbitMQ)
- β Sprint 6: Reliability & SRE (Backups, SLO, DR, Cost optimization)
- β Sprint 7: Documentation (Runbooks, Architecture, Guides)
- β Sprint 8: Automation & Maintenance (Scripts, CronJobs)
BONUS: β Agentic Execution Engine with 20+ tools
See devops-tasks.json for full task breakdown.
User: "Show me all pods in production namespace"
ATLAS: *Executes kubectl_get_pods(namespace="production")*
"Here are your production pods:
| Name | Status | Restarts | Age |
|------|--------|----------|-----|
| backend-python-abc | Running | 0 | 2d |
| frontend-xyz | Running | 1 | 1d |
| redis-123 | Running | 0 | 5d |"
User: "Why is backend-python crashing?"
ATLAS: *Executes kubectl_describe_pod*
*Executes kubectl_get_pod_logs*
*Executes kubectl_get_events*
"I found the issue! Your backend pod is crashing due to:
- OOMKilled (out of memory)
- Memory limit: 512Mi
- Actual usage: 600Mi
Recommendation: Increase memory limit to 1Gi"
User: "Scale backend to 5 replicas"
ATLAS: "β οΈ I can scale backend-python to 5 replicas.
This requires approval.
Current: 3 replicas
New: 5 replicas
Namespace: production
[Approve] [Reject]"
User: *clicks Approve*
ATLAS: *Executes kubectl_scale_deployment*
"β
Scaled backend-python to 5 replicas successfully!
All 5 pods are now Running and healthy."
- β
kubectl_get_pods- List pods in namespace - β
kubectl_get_pod_logs- Get pod logs - β
kubectl_describe_pod- Detailed pod info - β
kubectl_get_deployments- List deployments β οΈ kubectl_scale_deployment- Scale replicas (requires approval)β οΈ kubectl_delete_pod- Delete pod (requires approval)- β
kubectl_get_events- View K8s events - β
kubectl_top_pods- Resource usage
- Docker operations (ps, logs, inspect)
- Git operations (status, log, diff)
- Prometheus queries
- Health checks and error analysis
See AGENTIC_FEATURES.md for complete tool list.
Contributions welcome! This project follows best practices:
- β All changes via CI/CD
- β Security scanning on all commits
- β Infrastructure as Code (Terraform)
- β Full test coverage
- β Comprehensive documentation
MIT License - Feel free to use and modify
π Files: 90+
π» Lines of Code: ~15,000
π§ Languages: Python, TypeScript, Rust, YAML
π¦ Components: 28+
β±οΈ Time to Production: 1 day
β
Completion: 100%
π― Status: Production Ready
- Add your
ANTHROPIC_API_KEY - Deploy locally or to Kubernetes
- Switch to Agent Mode in UI
- Execute your first DevOps operation!
- Multi-cluster support
- Slack/Teams integration
- AI-powered cost recommendations
- Automatic incident response
- Terraform operation tools
- Helm operation tools
For issues or questions:
- Documentation: Check docs/ and runbooks
- Issues: Open an issue on GitHub
- Logs: Review in Grafana dashboards
- Alerts: Check Grafana OnCall for incidents
Built with these amazing open-source projects:
- Claude AI - Anthropic's AI assistant
- Kubernetes - Container orchestration
- Prometheus - Monitoring system
- Grafana - Observability platform
- Next.js - React framework
- FastAPI - Python web framework
- Rust - Systems programming language
β Star this repo if you find it useful!
Built with β€οΈ for the DevOps community
π€ The most complete AI DevOps platform - from framework to production deployment
Why ATLAS is different:
- π― Complete application, not just a framework
- β Enterprise approval workflow built-in
- π Full observability and cost optimization
- π Production-ready security from day one
- π Comprehensive documentation and runbooks