Skip to content

AWS-Devops-Projects/eks-troubleshooting-agentic-ai-chatops

 
 

Repository files navigation

NOTE: for sample code for this AWS Blog please use the code branch from this repository. This code branch is updated and is related to the AWS guidance below.

Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS

This guidance provides an example of Platform Engineering approach to troubleshooting Amazon EKS (Elastic Kubernetes Service) issues using Agentic AI workflow integrated with ChatOps via Slack

Strands-based AI Agentic workflow Troubleshooting: An intelligent agent using AWS Strands Agent framework with EKS MCP server integration for real-time troubleshooting

It can be deployed using Terraform, which provisions all necessary AWS resources including EKS cluster with compute plane, required add-ons, monitoring tools, and the agentic troubleshooting agent.

Architecture

Reference Architecture - EKS Cluster

Reference Architecture Diagram

Figure 1: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Reference Architecture

Reference Architecture Steps

  1. Amazon EKS Cluster: Managed Kubernetes control plane and worker nodes running containerized workloads, providing the foundation for the agentic troubleshooting system.

  2. Agentic Troubleshooting Agent: Strands-based multi-agent system deployed as pods in the EKS cluster, orchestrating intelligent troubleshooting workflows through specialized agents.

  3. Amazon Bedrock Integration: Provides foundational AI models (Claude for analysis, Titan Embeddings for semantic search) for natural language processing and intelligent troubleshooting recommendations.

  4. S3 Vectors Knowledge Base: Stores vector embeddings of historical troubleshooting cases, enabling semantic search and knowledge retrieval for similar past issues.

  5. EKS MCP Server: Model Context Protocol server enabling real-time Kubernetes cluster interaction, allowing agents to execute kubectl commands and gather cluster state information.

  6. Slack Integration: ChatOps interface using Socket Mode for real-time user interaction, allowing DevOps teams to troubleshoot issues through natural language conversations.

  7. Monitoring & Observability: Prometheus and Grafana stack for cluster health monitoring, metrics collection, and alerting to Slack channels.

Agentic AI Workflow Architecture

Agentic AI workflow Architecture Diagram

Figure 2: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Troubleshooting Workflow

Agentic AI Workflow Architecture Steps

  1. Setup - Agentic troubleshooting agent is deployed into an Amazon EKS cluster, configured with proper IAM permissions, Slack integration, and access to Amazon Bedrock and S3 Vectors knowledge base.

  2. User Interaction - Users (DevOps engineers, SREs, developers) who encounter Kubernetes (K8s) issues send troubleshooting requests through designated Slack channel integrated with K8s Troubleshooting AI Agent. Its components are running as containers on EKS cluster deployed from previously built images hosted in Elastic Container registry (ECR) via Helm charts that reference the services-built images

  3. Message Reception & Slack Integration - Slack establishes a WebSocket connection (Socket Mode) to the Orchestrator agent running in the EKS cluster, enabling real-time bidirectional communication.

  4. Intelligent Message Classification & Orchestration - Orchestrator agent receives users' message and calls Nova Micro model via Amazon Bedrock API to determine whether the message requires K8s troubleshooting. If an issue is classified as K8s-related, the Orchestrator agent initiates a workflow by delegating tasks to specialized agents while maintaining overall session context.

  5. Historical Knowledge Retrieval - Orchestrator agent invokes the Memory agent, which connects to Amazon S3 Vectors based knowledge base to search for similar troubleshooting cases for precise issue classification

  6. Semantic Vector Matching - The Memory agent invokes Titan Embeddings model via Amazon Bedrock API to generate semantic embeddings and perform vector similarity matching against the shared S3 Vectors knowledge base

  7. Real-Time Cluster Intelligence - Orchestrator agent invokes the K8s Specialist agent, which utilizes the hosted AWS EKS Model Context Protocol (MCP) Server to execute commands against the EKS API Server. The Client of MCP Server gathers real-time cluster state, pod logs, events, and resource metrics to better "understand" the current problem context.

  8. Intelligent Issue Analysis - K8s Specialist agent sends the collected cluster data to Anthropic Claude model via Amazon Bedrock for intelligent issue analysis and resolution generation.

  9. Comprehensive Solution Synthesis - Orchestrator agent synthesizes the historical context received from Memory agent and current cluster state from K8s Specialist, then uses Claude model via Amazon Bedrock to generate comprehensive troubleshooting recommendations, which are stored in S3 Vectors for future reference.

  10. ChatOps Integration - Orchestrator agent generates troubleshooting recommendations and sends them back to the Users via integrated Slack channel. This illustrates an increasingly popular "ChatOps" Platform Engineering pattern.

Project Code Structure

├── apps/                           # Application code
│   ├── agentic-troubleshooting/   # Strands-based agentic troubleshooting agent
│   │   ├── src/agents/            # Strands agent implementations
│   │   ├── src/tools/             # EKS MCP tools integration
│   │   ├── helm/                  # Kubernetes deployment charts
│   │   └── main.py                # Strands agent entry point
├── terraform/                      # Infrastructure as Code
│   ├── main.tf                    # Main EKS cluster configuration
│   ├── agentic.tf                 # Strands agent deployment resources
│   ├── modules/                   # Terraform modules (not used in agentic deployment)
│   ├── variables.tf               # Terraform variables
│   └── outputs.tf                 # Terraform outputs
├── static/                        # Static assets
└── demo/                          # Demo scripts and manifests

Prerequisites

Before running this project, make sure you have the following tools installed:

Slack Configuration (Required)

  1. Slack Webhook (Alert Manager notifications):

    • Create incoming webhook in your Slack workspace
    • Note the webhook URL and target channel name
  2. Slack Bot Configuration:

    • Create a Slack app with the following Bot Token Scopes:

      • app_mentions:read - View messages mentioning the bot
      • channels:history - View messages in public channels
      • channels:read - View basic channel information
      • chat:write - Send messages as the bot
      • groups:history - View messages in private channels
      • groups:read - View basic private channel information
      • im:history - View direct messages
      • im:read - View basic DM information
    • Event Subscriptions (enable these events):

      • app_mention - Bot mentions
      • message.channels - Channel messages
      • message.groups - Private channel messages
      • message.im - Direct messages
    • Enable Socket Mode for real-time events

    • Note the Bot Token (xoxb-...), App Token (xapp-...), and Signing Secret

Please see sample settings for Slack aplication OAuth and Scope permissions below:

Sample Slack OAuth application - permissions

Figure 3: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Sample app OAuth permissons

Sample Slack OAuth application - scopes

Figure 4: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Sample app OAuth Scopes

Sample Slack Application adding to Channel

Figure 5: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Adding Sample app to Slack Channel

Plan your Deployment

AWS Services in this Guidance

AWS Service Role Description
Amazon Elastic Kubernetes Service ( EKS) Core service Manages the Kubernetes control plane and compute nodes for container orchestration.
Amazon Elastic Compute Cloud (EC2) Core service Provides the compute instances for EKS compute nodes and runs containerized applications.
Amazon Virtual Private Cloud (VPC) Core Service Creates an isolated network environment with public and private subnets across multiple Availability Zones.
Amazon Simple Storage Service (S3) Core Service Stores vector embeddings for the knowledge base, enabling semantic search of historical troubleshooting cases.
AWS Bedrock Core Service Provides the foundational AI models (Claude for analysis, Titan Embeddings for semantic search) for natural language processing and intelligent troubleshooting recommendations.
Amazon Elastic Container Registry (ECR) Supporting service Stores and manages Docker container images for the agentic troubleshooting agent.
Elastic Load Balancing (NLB) Supporting service Distributes incoming traffic across multiple targets in the EKS cluster.
Amazon Elastic Block Store (EBS) Supporting service Provides persistent block storage volumes for EC2 instances in the EKS cluster.
AWS Identity and Access Management (IAM) Supporting service Manages security permissions and access controls for the agentic agent, ensuring secure interaction with EKS clusters, Bedrock, and S3 Vectors.
AWS Key Management Service (KMS) Security service Manages encryption keys for securing data in EKS and other AWS services.

Cost

You are responsible for the cost of the AWS services used while running this guidance. As of February 2026, the cost for running this guidance with the default settings in the US West (Oregon) Region is approximately $288.81/month.

We recommend creating a budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this guidance.

Sample cost table

The following table provides a sample cost breakdown for deploying this guidance with the default parameters in the us-west-2 (Oregon) Region for one month. This estimate is based on the AWS Pricing Calculator output for the agentic deployment. This does not factor heavy Bedrock usage beyond the baseline estimate.

AWS service Dimensions Cost, month [USD]
Amazon EKS 1 cluster $73.00
Amazon VPC 1 NAT Gateways $33.75
Amazon EC2 3 m5.large instances $156.16
Amazon EBS gp3 storage volumes and snapshots $7.20
Elastic Load Balancer 1 NLB for workloads $16.46
Amazon VPC Public IP addresses $3.65
AWS Key Management Service (KMS) Keys and requests $6.00
AWS Bedrock (Claude + Titan) 1M input tokens, 100K output tokens $30.00
Amazon S3 10GB storage, 10K requests $0.59
TOTAL $288.81/month

For a more accurate estimate based on your specific configuration and usage patterns, we recommend using the AWS Pricing Calculator.

Security

When you build systems on AWS infrastructure, security responsibilities are shared between you and AWS. This shared responsibility model reduces your operational burden because AWS operates, manages, and controls the components including the host operating system, the virtualization layer, and the physical security of the facilities in which the services operate. For more information about AWS security, visit AWS Cloud Security.

This guidance implements several security best practices and AWS services to enhance the security posture of your EKS Workload Ready Cluster. Here are the key security components and considerations:

Identity and Access Management (IAM)

  • EKS Managed Node Groups: These use IAM roles with specific permissions required for nodes to join the cluster and for pods to access AWS services.
  • EKS Pod Identity: The agentic troubleshooting agent uses EKS Pod Identity for secure, least-privilege access to AWS services including Bedrock, S3 Vectors, CloudWatch, and EKS APIs.

Network Security

  • Amazon VPC: The EKS cluster is deployed within a custom VPC with public and private subnets across multiple Availability Zones, providing network isolation.
  • Security Groups: Although not explicitly shown in the diagram, security groups are typically used to control inbound and outbound traffic to EC2 instances and other resources within the VPC.
  • NAT Gateways: Deployed in public subnets to allow outbound internet access for resources in private subnets while preventing inbound access from the internet.

Data Protection

  • Amazon EBS Encryption: EBS volumes used by EC2 instances are typically encrypted to protect data at rest.
  • AWS Key Management Service (KMS): Used for managing encryption keys for various services, including EBS volume encryption.

Kubernetes-specific Security

  • Kubernetes RBAC: Role-Based Access Control is implemented within the EKS cluster to manage fine-grained access to Kubernetes resources.
  • EKS Access Entries: The agentic agent uses EKS access entries with cluster admin policy for full cluster access required for troubleshooting operations.

Secrets Management

  • Kubernetes Secrets: Slack credentials (bot token, app token, signing secret) are stored as Kubernetes secrets and mounted into the agent pods.

Additional Security Considerations

  • Regularly update and patch EKS clusters, compute nodes, and container images.
  • Implement network policies to control pod-to-pod communication within the cluster.
  • Use Pod Security Policies or Pod Security Standards to enforce security best practices for pods.
  • Implement proper logging and auditing mechanisms for both AWS and Kubernetes resources.
  • Regularly review and rotate IAM and Kubernetes RBAC permissions.

Deployment Options

Please see detailed Implementation Guide for instructions for guidance deployment, validation, basic troubleshooting and uninstallation options.

References

This guidance uses:

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

This Guidance demonstrates how to address the complexity of troubleshooting Amazon EKS environments with multiple metrics and logs by implementing an agentic AI workflow that uses generative AI with RAG-enabled knowledge bases and chat interfaces to accelerate problem diagnosis.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 49.9%
  • Shell 19.8%
  • HCL 19.5%
  • Rich Text Format 7.9%
  • Go Template 1.8%
  • Dockerfile 1.1%