NOTE: for sample code for this AWS Blog please use the code branch from this repository. This code branch is updated and is related to the AWS guidance below.
This guidance provides an example of Platform Engineering approach to troubleshooting Amazon EKS (Elastic Kubernetes Service) issues using Agentic AI workflow integrated with ChatOps via Slack
Strands-based AI Agentic workflow Troubleshooting: An intelligent agent using AWS Strands Agent framework with EKS MCP server integration for real-time troubleshooting
It can be deployed using Terraform, which provisions all necessary AWS resources including EKS cluster with compute plane, required add-ons, monitoring tools, and the agentic troubleshooting agent.
Figure 1: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Reference Architecture
-
Amazon EKS Cluster: Managed Kubernetes control plane and worker nodes running containerized workloads, providing the foundation for the agentic troubleshooting system.
-
Agentic Troubleshooting Agent: Strands-based multi-agent system deployed as pods in the EKS cluster, orchestrating intelligent troubleshooting workflows through specialized agents.
-
Amazon Bedrock Integration: Provides foundational AI models (Claude for analysis, Titan Embeddings for semantic search) for natural language processing and intelligent troubleshooting recommendations.
-
S3 Vectors Knowledge Base: Stores vector embeddings of historical troubleshooting cases, enabling semantic search and knowledge retrieval for similar past issues.
-
EKS MCP Server: Model Context Protocol server enabling real-time Kubernetes cluster interaction, allowing agents to execute kubectl commands and gather cluster state information.
-
Slack Integration: ChatOps interface using Socket Mode for real-time user interaction, allowing DevOps teams to troubleshoot issues through natural language conversations.
-
Monitoring & Observability: Prometheus and Grafana stack for cluster health monitoring, metrics collection, and alerting to Slack channels.
Figure 2: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Troubleshooting Workflow
-
Setup - Agentic troubleshooting agent is deployed into an Amazon EKS cluster, configured with proper IAM permissions, Slack integration, and access to Amazon Bedrock and S3 Vectors knowledge base.
-
User Interaction - Users (DevOps engineers, SREs, developers) who encounter Kubernetes (K8s) issues send troubleshooting requests through designated Slack channel integrated with K8s Troubleshooting AI Agent. Its components are running as containers on EKS cluster deployed from previously built images hosted in Elastic Container registry (ECR) via Helm charts that reference the services-built images
-
Message Reception & Slack Integration - Slack establishes a WebSocket connection (Socket Mode) to the Orchestrator agent running in the EKS cluster, enabling real-time bidirectional communication.
-
Intelligent Message Classification & Orchestration - Orchestrator agent receives users' message and calls Nova Micro model via Amazon Bedrock API to determine whether the message requires K8s troubleshooting. If an issue is classified as K8s-related, the Orchestrator agent initiates a workflow by delegating tasks to specialized agents while maintaining overall session context.
-
Historical Knowledge Retrieval - Orchestrator agent invokes the Memory agent, which connects to Amazon S3 Vectors based knowledge base to search for similar troubleshooting cases for precise issue classification
-
Semantic Vector Matching - The Memory agent invokes Titan Embeddings model via Amazon Bedrock API to generate semantic embeddings and perform vector similarity matching against the shared S3 Vectors knowledge base
-
Real-Time Cluster Intelligence - Orchestrator agent invokes the K8s Specialist agent, which utilizes the hosted AWS EKS Model Context Protocol (MCP) Server to execute commands against the EKS API Server. The Client of MCP Server gathers real-time cluster state, pod logs, events, and resource metrics to better "understand" the current problem context.
-
Intelligent Issue Analysis - K8s Specialist agent sends the collected cluster data to Anthropic Claude model via Amazon Bedrock for intelligent issue analysis and resolution generation.
-
Comprehensive Solution Synthesis - Orchestrator agent synthesizes the historical context received from Memory agent and current cluster state from K8s Specialist, then uses Claude model via Amazon Bedrock to generate comprehensive troubleshooting recommendations, which are stored in S3 Vectors for future reference.
-
ChatOps Integration - Orchestrator agent generates troubleshooting recommendations and sends them back to the Users via integrated Slack channel. This illustrates an increasingly popular "ChatOps" Platform Engineering pattern.
├── apps/ # Application code
│ ├── agentic-troubleshooting/ # Strands-based agentic troubleshooting agent
│ │ ├── src/agents/ # Strands agent implementations
│ │ ├── src/tools/ # EKS MCP tools integration
│ │ ├── helm/ # Kubernetes deployment charts
│ │ └── main.py # Strands agent entry point
├── terraform/ # Infrastructure as Code
│ ├── main.tf # Main EKS cluster configuration
│ ├── agentic.tf # Strands agent deployment resources
│ ├── modules/ # Terraform modules (not used in agentic deployment)
│ ├── variables.tf # Terraform variables
│ └── outputs.tf # Terraform outputs
├── static/ # Static assets
└── demo/ # Demo scripts and manifests
Before running this project, make sure you have the following tools installed:
- Terraform CLI
- AWS CLI
- Python 3.8+
- Docker (for agentic application deployment)
- Helm (for agentic application deployment)
- Kubectl (for K8s CLI commands)
-
Slack Webhook (Alert Manager notifications):
- Create incoming webhook in your Slack workspace
- Note the webhook URL and target channel name
-
Slack Bot Configuration:
-
Create a Slack app with the following Bot Token Scopes:
app_mentions:read- View messages mentioning the botchannels:history- View messages in public channelschannels:read- View basic channel informationchat:write- Send messages as the botgroups:history- View messages in private channelsgroups:read- View basic private channel informationim:history- View direct messagesim:read- View basic DM information
-
Event Subscriptions (enable these events):
app_mention- Bot mentionsmessage.channels- Channel messagesmessage.groups- Private channel messagesmessage.im- Direct messages
-
Enable Socket Mode for real-time events
-
Note the Bot Token (
xoxb-...), App Token (xapp-...), and Signing Secret
-
Please see sample settings for Slack aplication OAuth and Scope permissions below:
Figure 3: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Sample app OAuth permissons
Figure 4: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Sample app OAuth Scopes
Figure 5: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Adding Sample app to Slack Channel
| AWS Service | Role | Description |
|---|---|---|
| Amazon Elastic Kubernetes Service ( EKS) | Core service | Manages the Kubernetes control plane and compute nodes for container orchestration. |
| Amazon Elastic Compute Cloud (EC2) | Core service | Provides the compute instances for EKS compute nodes and runs containerized applications. |
| Amazon Virtual Private Cloud (VPC) | Core Service | Creates an isolated network environment with public and private subnets across multiple Availability Zones. |
| Amazon Simple Storage Service (S3) | Core Service | Stores vector embeddings for the knowledge base, enabling semantic search of historical troubleshooting cases. |
| AWS Bedrock | Core Service | Provides the foundational AI models (Claude for analysis, Titan Embeddings for semantic search) for natural language processing and intelligent troubleshooting recommendations. |
| Amazon Elastic Container Registry (ECR) | Supporting service | Stores and manages Docker container images for the agentic troubleshooting agent. |
| Elastic Load Balancing (NLB) | Supporting service | Distributes incoming traffic across multiple targets in the EKS cluster. |
| Amazon Elastic Block Store (EBS) | Supporting service | Provides persistent block storage volumes for EC2 instances in the EKS cluster. |
| AWS Identity and Access Management (IAM) | Supporting service | Manages security permissions and access controls for the agentic agent, ensuring secure interaction with EKS clusters, Bedrock, and S3 Vectors. |
| AWS Key Management Service (KMS) | Security service | Manages encryption keys for securing data in EKS and other AWS services. |
You are responsible for the cost of the AWS services used while running this guidance. As of February 2026, the cost for running this guidance with the default settings in the US West (Oregon) Region is approximately $288.81/month.
We recommend creating a budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this guidance.
The following table provides a sample cost breakdown for deploying this guidance with the default parameters in the
us-west-2 (Oregon) Region for one month. This estimate is based on the AWS Pricing Calculator output for the agentic deployment. This does not factor heavy Bedrock usage beyond the baseline estimate.
| AWS service | Dimensions | Cost, month [USD] |
|---|---|---|
| Amazon EKS | 1 cluster | $73.00 |
| Amazon VPC | 1 NAT Gateways | $33.75 |
| Amazon EC2 | 3 m5.large instances | $156.16 |
| Amazon EBS | gp3 storage volumes and snapshots | $7.20 |
| Elastic Load Balancer | 1 NLB for workloads | $16.46 |
| Amazon VPC | Public IP addresses | $3.65 |
| AWS Key Management Service (KMS) | Keys and requests | $6.00 |
| AWS Bedrock (Claude + Titan) | 1M input tokens, 100K output tokens | $30.00 |
| Amazon S3 | 10GB storage, 10K requests | $0.59 |
| TOTAL | $288.81/month |
For a more accurate estimate based on your specific configuration and usage patterns, we recommend using the AWS Pricing Calculator.
When you build systems on AWS infrastructure, security responsibilities are shared between you and AWS. This shared responsibility model reduces your operational burden because AWS operates, manages, and controls the components including the host operating system, the virtualization layer, and the physical security of the facilities in which the services operate. For more information about AWS security, visit AWS Cloud Security.
This guidance implements several security best practices and AWS services to enhance the security posture of your EKS Workload Ready Cluster. Here are the key security components and considerations:
- EKS Managed Node Groups: These use IAM roles with specific permissions required for nodes to join the cluster and for pods to access AWS services.
- EKS Pod Identity: The agentic troubleshooting agent uses EKS Pod Identity for secure, least-privilege access to AWS services including Bedrock, S3 Vectors, CloudWatch, and EKS APIs.
- Amazon VPC: The EKS cluster is deployed within a custom VPC with public and private subnets across multiple Availability Zones, providing network isolation.
- Security Groups: Although not explicitly shown in the diagram, security groups are typically used to control inbound and outbound traffic to EC2 instances and other resources within the VPC.
- NAT Gateways: Deployed in public subnets to allow outbound internet access for resources in private subnets while preventing inbound access from the internet.
- Amazon EBS Encryption: EBS volumes used by EC2 instances are typically encrypted to protect data at rest.
- AWS Key Management Service (KMS): Used for managing encryption keys for various services, including EBS volume encryption.
- Kubernetes RBAC: Role-Based Access Control is implemented within the EKS cluster to manage fine-grained access to Kubernetes resources.
- EKS Access Entries: The agentic agent uses EKS access entries with cluster admin policy for full cluster access required for troubleshooting operations.
- Kubernetes Secrets: Slack credentials (bot token, app token, signing secret) are stored as Kubernetes secrets and mounted into the agent pods.
- Regularly update and patch EKS clusters, compute nodes, and container images.
- Implement network policies to control pod-to-pod communication within the cluster.
- Use Pod Security Policies or Pod Security Standards to enforce security best practices for pods.
- Implement proper logging and auditing mechanisms for both AWS and Kubernetes resources.
- Regularly review and rotate IAM and Kubernetes RBAC permissions.
Please see detailed Implementation Guide for instructions for guidance deployment, validation, basic troubleshooting and uninstallation options.
This guidance uses:
- Terraform AWS EKS Blueprints for infrastructure
- AWS Strands Agent Framework for multi-agent orchestration (Agentic deployment)
- EKS MCP Server for Kubernetes integration via Model Context Protocol (Agentic deployment)
- Amazon Bedrock model hosting for semantic vector matching and solution content validation
- Slack enterprise communications platform
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.




