Skip to content

Commit 2d76cc3

Browse files
coryodanielchrisghill
authored andcommitted
Updating operator guide w/ design decisions and runbook
1 parent dc38b82 commit 2d76cc3

File tree

1 file changed

+95
-82
lines changed

1 file changed

+95
-82
lines changed

operator.md

Lines changed: 95 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -1,119 +1,132 @@
1-
# aws-eks-cluster
2-
AWS EKS (Elastic Kubernetes Service) is Amazon's managed Kubernetes service, making it easy to deploy, operate, and scale containerized applications and providing benefits such as automatic scaling of worker nodes, automatic upgrades and patching, integration with other AWS services, and access to the Kubernetes community and ecosystem.
1+
## AWS EKS (Elastic Kubernetes Service)
32

4-
## Use Cases
5-
### Container orchestration
6-
Kubernetes is the most powerful container orchestrator, making it easy to deploy, scale, and manage containerized applications.
7-
### Microservices architecture
8-
Kubernetes can be used to build and manage microservices-based applications, allowing for flexibility and scalability in a distributed architecture.
9-
### Big Data and Machine Learning
10-
Kubernetes can be used to deploy and manage big data and machine learning workloads, providing scalability and flexibility for processing and analyzing large data sets.
11-
### Internet of Things (IoT)
12-
Kubernetes can be used to manage and orchestrate IoT applications, providing robust management and scaling capabilities for distributed IoT devices and gateways.
3+
AWS EKS is a managed Kubernetes service that makes it easy to run Kubernetes on AWS without needing to manage your own Kubernetes control plane. Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications.
134

14-
## Configuration Presets
15-
### Development Cluster
16-
This preset creates a cluster with a single node group of cost effective t3.medium instances.
17-
### Production Cluster
18-
This preset creates a cluster with a single node group of compute optimized c5.2xlarge instances.
5+
### Design Decisions
196

20-
## Design
21-
EKS provides a "barebones" Kubernetes control plane, meaning that it only includes the essential components required to run a Kubernetes cluster. These components include the [Kubernetes API server](https://kubernetes.io/docs/concepts/overview/components/#kube-apiserver), [etcd](https://kubernetes.io/docs/concepts/overview/components/#etcd) (a distributed key-value store for storing Kubernetes cluster data), the [controller manager](https://kubernetes.io/docs/concepts/overview/components/#kube-controller-manager) and the [scheduler](https://kubernetes.io/docs/concepts/overview/components/#kube-scheduler).
7+
1. **IAM Roles and Policies**: Distinct IAM roles for EKS cluster and node groups to ensure security and proper role-based access.
8+
2. **Logging and Monitoring**: EKS control plane logs are sent to CloudWatch for centralized logging and monitoring.
9+
3. **Add-ons**: Enabled multiple AWS EKS add-ons like EBS CSI, cluster autoscaler, and Prometheus observability.
10+
4. **Cert-Manager and External-DNS**: Enabled cert-manager for certificate management and External-DNS for automated DNS updates.
11+
5. **KMS Encryption**: Used AWS KMS for encrypting secrets within the EKS cluster.
12+
6. **Fargate Support**: Conditional role creation for Fargate profile if Fargate is enabled.
2213

23-
In order simplify deploying and operating a Kubernetes cluster, this bundle includes numerous optional addons to deliver a fully capable and feature rich cluster that's ready for production workloads. Some of these addons are listed below.
14+
### Runbook
2415

25-
### Cluster Autoscaler
26-
A [cluster autoscaler](https://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html#cluster-autoscaler) is installed into every EKS cluster to automatically scale the number of nodes in the cluster based on the current resource usage. This providers numerous benefits such as cost efficiency, higher availability and better resource utilization.
27-
### NGINX Ingress Controller
28-
Users can optionally install the ["official" Kubernetes NGINX ingress controller](https://kubernetes.github.io/ingress-nginx/) (not to be confused with [NGINX's own ingress controller](https://docs.nginx.com/nginx-ingress-controller/) based on the paid nGinx-plus) into their cluster, which allows workloads in your EKS cluster to be accessible from the internet.
29-
### External-DNS and Cert-Manager
30-
If users associate one or more Route53 domains to their EKS cluster, this bundle will automatically install [external-dns](https://github.com/kubernetes-sigs/external-dns) and [cert-manager](https://cert-manager.io/docs/) in the cluster, allowing the cluster to automatically create and manage DNS records and TLS certificates for internet accessible workloads.
31-
### EBS CSI Driver
32-
[Beginning in Kubernetes version 1.23, EKS no longer comes with the default EBS provisioner](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-1.23). In order to allow users to continue using the default `gp2` storage class, this bundle includes the [EBS CSI Driver](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html), which replaces the deprecated EBS provisioner.
33-
### EFS CSI Driver
34-
Optionally, users can also install the [EFS CSI Driver](https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html) which will allow the EKS cluster to attach EFS volumes to cluster workloads for persistant storage. EFS volumes offer some benefits over EBS volumes, such as [allowing multiple pods to use the volume simultaneously (ReadWriteMany)](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes) and not being being locked to a single AWS availability zone, but these benefits come with higher storage costs and increased latency.
16+
#### Troubleshooting EKS Cluster Connectivity Issues
3517

36-
### Fargate
18+
Kubernetes cluster endpoint might be unreachable. Verify the connectivity and authentication.
3719

38-
Fargate can be enabled to allow AWS to provide on-demand, right-sized compute capacity for running containers on EKS without managing node pools or clusters of EC2 instances.
20+
**Check Cluster Endpoint**
3921

40-
For workloads that require high uptime, its recommended to keep some node pools populated even when enabling Fargate to ensure compute is always available during surges.
22+
```sh
23+
aws eks describe-cluster --name your-cluster-name --query "cluster.endpoint"
24+
```
25+
Ensure the endpoint is reachable from your network.
26+
27+
**Check Authentication Token**
28+
29+
```sh
30+
aws eks get-token --cluster-name your-cluster-name
31+
```
32+
Verify that the token is generated without errors.
33+
34+
**Kubernetes API Server Logs**
35+
36+
```sh
37+
kubectl logs -n kube-system $(kubectl get pods -n kube-system -l k8s-app=kube-apiserver -o name) -c kube-apiserver
38+
```
39+
This command aggregates logs from the API server for debugging potential issues.
40+
41+
#### Certificate Issues with Cert-Manager
42+
43+
Cert-Manager might fail to issue certificates due to misconfigurations or API rate limits.
44+
45+
**Check Cert-Manager Logs**
4146

42-
Fargate has many [limitations](https://docs.aws.amazon.com/eks/latest/userguide/fargate.html).
47+
```sh
48+
kubectl logs -n md-core-services -l app=cert-manager
49+
```
50+
Look for errors indicating why certificates might be failing.
4351

44-
Currently only `namespace` selectors are implemented. If you need `label` selectors please file an [issue](https://github.com/massdriver-cloud/aws-eks-cluster/issues).
52+
**Validate ClusterIssuer**
4553

46-
## Best Practices
47-
### Managed Node Groups
48-
Worker nodes in the cluster are provisioned as [managed node groups](https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html).
49-
### Secure Networking
50-
Cluster is designed according to [AWS's EKS networking best practices](https://docs.aws.amazon.com/eks/latest/userguide/network_reqs.html) including deploying nodes in private subnets and only deploying public load balancers into public subnets.
51-
### Cluster Autoscaler
52-
A cluster autoscaler is automatically installed to provide node autoscaling as workload demand increases.
53-
### Open ID Connect (OIDC) Provider
54-
Cluster is pre-configured for out-of-the box support of [IAM Roles for Service Accounts (IRSA)](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html).
54+
```sh
55+
kubectl describe clusterissuer letsencrypt-prod
56+
```
57+
Ensure that the ClusterIssuer configuration is correct and the ACME server is reachable.
5558

59+
#### DNS Resolution Problems with External-DNS
5660

57-
## Security
58-
### Nodes Deployed into Private Subnets
59-
Worker nodes are provisioned into private subnets for security.
60-
### IAM Roles for Service Accounts
61-
IRSA allows kubernetes pods to assume AWS IAM Roles, removing the need for static credentials to access AWS services.
62-
### Secret Encryption
63-
An AWS KMS key is created and associated to the cluster to enable [encryption of secrets](https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/) at rest.
64-
### IMDSv2 Required on Node Groups
65-
The [Instance Metadata Service version 2 (IMDSv2)]() is required on all EKS node groups. IMDSv1, which was the cause of the [2019 CapitalOne data breach](https://divvycloud.com/capital-one-data-breach-anniversary/), is disabled on all node groups.
61+
DNS records might fail to update in Route 53.
6662

67-
## Connecting
68-
After you have deployed a Kubernetes cluster through Massdriver, you may want to interact with the cluster using the powerful [kubectl](https://kubernetes.io/docs/reference/kubectl/) command line tool.
63+
**Check External-DNS Logs**
6964

70-
### Install Kubectl
65+
```sh
66+
kubectl logs -n md-core-services -l app=external-dns
67+
```
68+
Identify any error messages related to DNS updates or API limits.
7169

72-
You will first need to install `kubectl` to interact with the kubernetes cluster. Installation instructions for Windows, Mac and Linux can be found [here](https://kubernetes.io/docs/tasks/tools/#kubectl).
70+
**Verify Route 53 Hosted Zones**
7371

74-
Note: While `kubectl` generally has forwards and backwards compatibility of core capabilities, it is best if your `kubectl` client version is matched with your kubernetes cluster version. This ensures the best stability and compability for your client.
72+
```sh
73+
aws route53 list-hosted-zones
74+
```
75+
Ensure that hosted zones' IDs and names match your Route 53 configuration.
7576

77+
#### EBS CSI Driver Storage Issues
7678

77-
The standard way to manage connection and authentication details for kubernetes clusters is through a configuration file called a [`kubeconfig`](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) file.
79+
Persistent Volumes may fail to provision or attach to nodes.
7880

79-
### Download the Kubeconfig File
81+
**Check EBS CSI Driver Logs**
8082

81-
The standard way to manage connection and authentication details for kubernetes clusters is through a configuration file called a [`kubeconfig`](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) file. The `kubernetes-cluster` artifact that is created when you make a kubernetes cluster in Massdriver contains the basic information needed to create a `kubeconfig` file. Because of this, Massdriver makes it very easy for you to download a `kubeconfig` file that will allow you to use `kubectl` to query and administer your cluster.
83+
```sh
84+
kubectl logs -n kube-system -l app=ebs-csi-controller
85+
```
86+
Review logs to identify issues with volume provisioning or attachment.
8287

83-
To download a `kubeconfig` file for your cluster, navigate to the project and target where the kubernetes cluster is deployed and move the mouse so it hovers over the artifact connection port. This will pop a windows that allows you to download the artifact in raw JSON, or as a `kubeconfig` yaml. Select "Kube Config" from the drop down, and click the button. This will download the `kubeconfig` for the kubernetes cluster to your local system.
88+
**Manually Describe a Volume**
8489

85-
![Download Kubeconfig](https://github.com/massdriver-cloud/aws-eks-cluster/blob/main/images/kubeconfig-download.gif?raw=true)
90+
```sh
91+
aws ec2 describe-volumes --volume-ids vol-xxxxxxx
92+
```
93+
Verify the status and details of the problematic volume directly.
8694

87-
### Use the Kubeconfig File
95+
#### Pod Scheduling Problems (Cluster Autoscaler)
8896

89-
Once the `kubeconfig` file is downloaded, you can move it to your desired location. By default, `kubectl` will look for a file named `config` located in the `$HOME/.kube` directory. If you would like this to be your default configuration, you can rename and move the file to `$HOME/.kube/config`.
97+
Pods might remain in "Pending" state due to lack of resources or other scheduling issues.
9098

91-
A single `kubeconfig` file can hold multiple cluster configurations, and you can select your desired cluster through the use of [`contexts`](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/#context). Alternatively, you can have multiple `kubeconfig` files and select your desired file through the `KUBECONFIG` environment variable or the `--kubeconfig` flag in `kubectl`.
99+
**Check Cluster Autoscaler Logs**
92100

93-
Once you've configured your environment properly, you should be able to run `kubectl` commands. Here are some commands to try:
101+
```sh
102+
kubectl logs -n kube-system -l app=cluster-autoscaler
103+
```
104+
Look for reasons why the autoscaler might not be scaling up nodes.
94105

95-
```bash
96-
# get a list of all pods in the current namespace
97-
kubectl get pods
106+
**Verify Node Resources**
98107

99-
# get a list of all pods in the kube-system namespace
100-
kubectl get pods --namespace kube-system
108+
```sh
109+
kubectl describe node <node-name>
110+
```
111+
Check node capacity and allocations to identify resource issues.
101112

102-
# get a list of all the namespaces
103-
kubectl get namespaces
113+
#### Metrics and Monitoring Issues
104114

105-
# view the logs of a running pod in the default namespace
106-
kubectl logs <pod name> --namespace default
115+
Problems with collecting or visualizing metrics using Prometheus and Grafana.
107116

108-
# describe the status of a deployment in the foo namespace
109-
kubectl describe deployment <deployment name> --namespace foo
117+
**Check Prometheus Operator Logs**
110118

111-
# get a list of all the resources the kubernetes cluster can manage
112-
kubectl api-resources
119+
```sh
120+
kubectl logs -n md-observability -l app.kubernetes.io/name=prometheus-operator
113121
```
122+
Identify potential issues with Prometheus scraping or alerting configurations.
114123

115-
## AWS Access
124+
**Access Grafana UI**
125+
126+
```sh
127+
kubectl port-forward svc/grafana -n md-observability 3000:3000
128+
```
129+
Verify that Grafana is accessible and that dashboards display the expected metrics.
116130

117-
If you would like to manage access your EKS cluster through AWS IAM principals, you can do so via the `aws-auth` ConfigMap. This will allow the desired AWS IAM principals to view cluster status in the AWS console, as well as generate short-lived credentials for `kubectl` access. Refer to the [AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/add-user-role.html) for more details.
131+
By utilizing these runbook commands and tools, you can troubleshoot and manage your AWS EKS resources effectively.
118132

119-
**Note**: In order to connect to the EKS cluster to view or modify the `aws-auth` ConfigMap, you'll need to download the `kubeconfig` file and use `kubectl` as discussed earlier.

0 commit comments

Comments
 (0)