Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(eks): add eks cluster builder #1259

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

jijiechen
Copy link
Member

@jijiechen jijiechen commented Jan 16, 2025

closes #285

Features

This PR provides the following features:

  • a cluster Builder for creating EKS clusters on AWS
  • a Cluster representing the cluster created and provides the functionality of cleaning up the cluster and all underlying resources created
  • a NewFromExisting helper function to retrieve kubeconfig from an existing cluster

Above features are tested manually and a CI run can be found here:
https://github.com/kumahq/kuma-smoke/actions/runs/12785877388/job/35641958835

Implementation

Unlike eksctl which creates the cluster using AWS CloudFormation, this PR creates underlying AWS resources directly using the Golang SDK provided by AWS.

  • The entrypoint for creating a new cluster is the function eks.aws_operations.CreateEKSClusterAll

  • The entrypoint for cleaning up a cluster is the function eks.aws_operations.DeleteEKSClusterAll

Reused the logic of eksctl to generate userdata for bootstraping the nodes.

Limitations

  1. only IPv4 networking is supported as of now (this will be resolved by a new feature plan)
  2. nodes created for now are not "managed" nodes, they are backed by launch templates (this will be resolved by a new feature plan)
  3. no load balancer integration was tested (this will be resolved by a new feature/test plan)
  4. only tested using a fedarated user and did not test IAM users with programmatic access as of now (technically, it should work for free as we are using the official SDK to read credentials, and I will test as soon as I get a user with programmatic access)
  5. does not provide any retry mechanism when calling AWS APIs, this could lead to unstable API calls and in-complete cluster creation/cleanup.
  6. cluster cleanup takes more time than the gke version implementation, because AWS does not provide a "fire-and-forget" API to use for cleaning up all the resources involved.
  7. due to AWS platform characteristics, a cluster creation attempt takes ~15mins and a deletion takes ~10mins.
  8. due to AWS EKS limitation, the cluster client kube config has a maximum of 900s (15mins). So for long running tests, we need to export the client config peridlically to refresh the validity.

Usage

	t.Log("configuring EKS cloud environment for tests")
	require.NotEmpty(t, EKSAccessKeyId, "%s not set", eks.EnvAccessKeyId)
	require.NotEmpty(t, EKSAccessKey, "%s not set", eks.EnvAccessKey)
	require.NotEmpty(t, EKSRegion, "%s not set", eks.EnvRegion)

	t.Logf("configuring the EKS cluster KEY_ID=(%s) REGION=(%s)", EKSAccessKeyId, EKSRegion)
	builder := eks.NewBuilder()
	builder.WithClusterVersion(fmt.Sprintf("%d.%d.0", EKSVersionMajor, EKSVersionMinor))

	t.Logf("building cluster %s (this can take some time)", builder.Name)
	cluster, err := builder.Build(ctx)
	require.NoError(t, err)

	t.Logf("setting up cleanup for cluster %s", cluster.Name())
	t.Cleanup(func() {
		t.Logf("running cluster cleanup for %s", cluster.Name())
		// don't use test ctx as it may be cancelled already
		assert.NoError(t, cluster.Cleanup(context.Background()))
	})

	t.Log("verifying that the cluster can be communicated with")
	version, err := cluster.Client().ServerVersion()
	require.NoError(t, err)
	t.Logf("server version found: %s", version)

I'll provide an integration test soon in a new PR.

Signed-off-by: Jay Chen <1180092+jijiechen@users.noreply.github.com>
@jijiechen jijiechen requested a review from a team as a code owner January 16, 2025 03:24
TagNameCreateBy = "ktf_created_by"
)

func CreateEKSClusterAll(ctx context.Context, cfg aws.Config, clusterName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like to be a function to create a cluster and all its dependencies. Could we add comments to tell this?

)

func CreateEKSClusterAll(ctx context.Context, cfg aws.Config, clusterName,
k8sMinorVersion, nodeMachineType string, tags map[string]string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we group the parameters with the same type to the same line to make a clearer view on the types of the params?

return errors.Wrapf(err, "failed to get availability zones in region %s", cfg.Region)
}

vpcId, subnetIDs, err := createVPC(ctx, ec2Client, subnetAvZones)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it leave dangling resources when error happens after creating a VPC? I think we need to clean them if possible, or add a comment to tell that it may leave resource to clear.

return nil
}

func DeleteEKSClusterAll(ctx context.Context, cfg aws.Config, clusterName string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here. I think we need to add comments to tell what resources are deleted here.

func waitForClusterActive(ctx context.Context, eksClient *eks.Client, clusterName string) (*types.Cluster, error) {
childCtx, cancel := context.WithTimeout(ctx, 10*time.Minute)
defer cancel()
ticker := time.NewTicker(10 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the 10 second ticker too long? Or should we make the wait time and wait tick configurable?

}

func waitForNodeGroupReady(ctx context.Context, eksClient *eks.Client, clusterName, nodeGroupName string) error {
childCtx, cancel := context.WithTimeout(ctx, 10*time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here for wait time and wait tick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWS EKS cluster support
2 participants