EKS v3 Release #1453

flostadler · 2024-10-17T14:17:56Z

This change includes all the changes from the release branch: https://github.com/pulumi/pulumi-eks/tree/release-3.x.x rebased onto the current master branch.

All changes in this change set are approved PRs that were merged into the release branch.

)

This adds the necessary handling for `nodeadm` user data. This is used for AL2023. `nodeadm` is a tool used for bootstrapping kubernetes nodes. It's configuration interface is yaml based and can be set via user data. The user data needs to be in MIME `multipart/mixed` format. This allows interleaving the nodeadm configuration with scripts or other user data entries. See more here: https://awslabs.github.io/amazon-eks-ami/nodeadm/.

### Proposed changes This change adds the necessary options for configuring settings for Bottlerocket Operating Systems. Bottlerocket configuration is driven by a user data script in TOML format. The provider will set the base configuration that's necessary for nodes to successfully register with the kubernetes cluster. Users will have the ability to add additional settings or override the base configuration by using the `bottlerocketSettings` parameter. An overview of the settings can be seen here: https://bottlerocket.dev/en/os/1.20.x/api/settings/. ### New dependencies This adds two dependencies to the provider. 1. `@iarna/toml"`: For converting the configuration into valid TOML. This allows us to expose `bottlerocketSettings` as an object instead of a string so users do not need to worry about TOML formatting&serialization 2. `ipaddr.js`: For calculating the `cluster-dns-ip`. We could definitely create our own IP parser, it's not overly complex for IPv4. But IPv6 can be more tricky with shortened formats. This is unnecessary complexity for the provider. Both of them are maintained libraries with 0 other dependencies. ### Remarks Bottlerocket will only be supported with `NodeGroupV2` and `ManagedNodeGroup` components because the older `NodeGroup` (aka `NodeGroupV1`) uses CloudFormation under the hood and nodes need to signal that they're ready. Bottlerocket can't execute scripts as part of the boot up because it doesn't have a shell, so this is not supported.

#1337) The provider wrongly assumed that an AMI only has a single block device. But Bottlerocket has two. Bottlerocket has two block devices, the root device stores the OS itself and the other is for data like images, logs, persistent storage. We need to allow users to configure the block device for data. With this change, the provider will choose the block device that gets modified depending on the OS. If the OS is bottlerocket, the data device gets modified. This also adds E2E tests verifying that the node storage capacity correctly reflects user settings.

The `ManagedNodeGroup` component was missing configuration the other node groups had. In detail, that's `amiId`, `gpu` and `userData`. Those will allow booting specific/custom AMIs, nodes with GPUs or setting custom user data. The added E2E tests ensure this works as expected. Relates to #1224

…1340) This change adds a new input property called `nodeadmExtraConfig` to the node group components. This property will allow injecting additional nodeadm sections into the user data. This can be virtually anything. Some data, a shell script, or additional nodeadm [`NodeConfig`](https://awslabs.github.io/amazon-eks-ami/nodeadm/). The nodeadm user data is a MIME multipart/mixed document. And every section has string based `content` and a MIME multipart `contentType`. Right now there's no straight forward way to generate types for the nodeadm `NodeConfig` because it's not schematized. Work for enhancing this is tracked here: #1341.

### Proposed changes  This PR switches the `coredns` and `kube-proxy` addons from self-managed to managed. By default the latest compatible version will be used. This also introduces two new top level arguments to `ClusterOptions` for configuring these new addons. - `corednsAddonOptions` - `kubeProxyAddonOptions` BREAKING CHANGE: creating an `eks.Cluster` will now also create the `coredns` and `kube-proxy` addons. If you are currently already managing these you will need to disable the creation of these through the new arguments `ClusterOptions.corednsAddonOptions.enabled = false` and `ClusterOptions.kubeProxyAddonOptions.enabled = false` ### Related issues (optional) closes #1261, closes #1254

### Proposed changes The coredns managed addon can only be deployed on clusters with default node groups (which includes Fargate clusters).  ### Related issues (optional)

Now that the EKS addons are added we need to align them and do some cleanup. This involves: - adding the enums introduced in #1357 to the VPC CNI - exposing configurationValues for coredns and kube-proxy - removing kubectl from the provider - deeply sort addon configuration keys to guarantee stable json serialization - remove deepmerge again. It caused issues during unit tests (voodoocreation/ts-deepmerge#22) and when used on outputs. Additionally I discovered and fixed an old bug that luckily never surfaced. The VPC CNI configuration did incorrectly handle outputs and called `toString` on them in a couple of places. The increased type safety and tests around addon configuration uncovered this. Closes #1369

AWS deprecated AL2 and it will be EOL'ed in June 2025. This change marks the AL2 related ami types as deprecated so users are aware of this deprecation. The type `AmiTypes` is not released yet, so this is not a user facing change. As a follow up task we want to publish a migration guide: pulumi/home#3626 Closes #1351

The `NodeGroup` component uses the deprecated AWS Launch Configuration ([see](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-configurations.html)) This marks the legacy CloudFormation based self-managed NodeGroup (also referred to as NodeGroupV1) as deprecated. The Pulumi native NodeGroupV2 is functionally equivalent (same inputs) but doesn't suffer from problems like [pulumi-eks#535](#535). Users will need to replace their self managed node groups anyway to migrate away from AL2 in a safe way (see [What does a node group update look like for users?](https://docs.google.com/document/d/1XyLq_EyAziCp3f6rQ_8qfcUk0RMl8AgUq1mqpqsQcHM/edit#bookmark=id.l6qozya46ay3)). This also switches the default node group of the cluster component to use NodeGroupV2 instead. Closes #1353 Closes #1352

When using [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) for automatically scaling node groups based on cluster requirements pulumi permanently shows a diff. The problem is because the cluster autoscaler takes control of the desiredSize of the scalingConfig, putting it out of sync with Pulumi state. MLCs don't support ignore changes for their children resources, because of this we're adding a new input that allows you to selectively ignore scaling changes. This is not added to the deprecated `NodeGroup` (aka `NodeGroupV1`) because the desired size is part of the cloudformation stack that is deployed as part of that component. We cannot selectively ignore changes to the cloudformation template because its a string. Closes #985 #1293

closes #603

When configuring Managed Node Groups without a version, they default to using the current cluster version at deploy time. Changes to the clusters version do not propagate to the node group in this case. EKS managed node groups only support 1 minor version skew between the control plane and data plane. Otherwise cluster upgrades will fail. This change makes the ManagedNodeGroups track the cluster version unless a fixed version is provided by users. Fixes #1253

Driven by the deprecation of AL2 by AWS, we need to ensure users are deploying node groups with maintained and secure operating systems by default. This change adds a `RECOMMENDED` OS enum that points to AL2023 (AWS default) and uses it as the default for node groups. The upgrade tests are expected to fail as we're changing defaults. To re-record we need to first release a new baseline version (e.g. alpha release) Closes #1354

The provider was missing a pre-release workflow for publishing alpha/beta versions of the provider. This means that it was always deploying docs changes even for pre-releases. This changes that by conditionally skipping the docs publishing step for prereleases.

The release branch for EKS v3 contains breaking changes now. This causes the upgrade tests to fail. Because of that we disabled the upgrade tests. Once the first alpha version was released we can re-record and re-enable them.

The current default, t2.medium, is an instance type from 2014 that is becoming less common in AWS data centers. This means users will encounter more errors when deploying clusters with the provider when using the default instance type. This mostly affects beginner users, as more experienced users typically do not rely on the default instance types and instead configure appropriate types for their workloads. This change replaces the default t2.medium instances with t3.medium. These newer instances offer better performance and are marginally cheaper ($0.0416 vs. $0.0464 per hour).

It seems that the env cannot be used for controlling whether jobs run. This now uses the underlying expression directly. This is the error we got: ``` The workflow is not valid. .github/workflows/release.yml (Line: 351, Col: 9): Unrecognized named-value: 'env'. Located at position 1 within expression: env.IS_PRERELEASE != 'true' ```

Upgrade publishing workflows to more modern versions borrowing from pulumi/pulumi-aws. Fixes Node SDK publishing. Since Node SDK is now generated under sdk/node and is no longer special compared to other SDKs some changes were needed to the GitHub Actions publishing process to get it to work right.

On master this works: - name: Create GH Release uses: softprops/action-gh-release@v2 with: generate_release_notes: true files: | dist/*.tar.gz prerelease: ${{ env.IS_PRERELEASE }} env: GITHUB_TOKEN: ${{ secrets.PULUMI_BOT_TOKEN }} On release-3.x.x this fails: ``` publish: name: publish needs: - prerequisites - test-nodejs - test-python - test-dotnet - test-go uses: ./.github/workflows/publish.yml secrets: inherit with: version: ${{ needs.prerequisites.outputs.version }} isPrerelease: ${{ env.IS_PRERELEASE }} ``` With: ``` Invalid workflow file: .github/workflows/release.yml#L194 The workflow is not valid. .github/workflows/release.yml (ine: 194, Col: 21): Unrecognized named-value: 'env'. Located at position 1 within expression: env.IS_PRERELEASE ``` Possibly related actions/runner#1189 Working around by in-lining the ENV var.

Go SDK must be versioned as v3 otherwise Go refuses to use this.

This change adds the migration guide for EKS v3. We'll also publish this to the docs as part of Relates to pulumi/home#3626, but by having it in the repo we can already send it to alpha users. Relates to pulumi/home#3626

Upgrades javagen to v0.16.1. This might solve the problems in #1402

…1410) In #1373 the default node group was updated to use the `NodeGroupV2` component. We missed changing the `NodeGroupData` type to reflect this. It was still referring to a property called `autoScalingGroupName`, but it should've been changed to expose an `autoScalingGroup`. Fixes #1402

Historically the following `NodeGroup` & `NodeGroupV2` input properties have been plain: - `kubeletExtraArgs` - `bootstrapExtraArgs` - `labels` - `taints` - `nodeAssociatePublicIpAddress` Those should instead be inputs so users can pass outputs into them. fixes #1274

Re: pulumi/ci-mgmt#1091 This additionally bumps the pu/pu version to 3.135.0.

Pulumi EKS currently always creates a cluster security group and node security group. - The cluster security group gets assigned to the control plane ENIs in addition to the security group EKS creates (see [AWS Docs](https://docs.aws.amazon.com/eks/latest/userguide/sec-group-reqs.html)). This security group gets an ingress rule from the node security group. - The node security group gets assigned to `NodeGroup` and `NodeGroupV2` components that do not specify a custom security group. Users that either manage the node security themselves or use the `ManagedNodeGroup` component (uses the EKS created SG) do not need those default security groups. This change adds a flag on the cluster (`skipDefaultSecurityGroups`) that will skip creating those default security groups. Instead. This introduces a small breaking change, the `clusterSecurityGroup`, `nodeSecurityGroup` and `clusterIngressRule` outputs are now optional. The impact of this should be minimal because users that create custom node groups usually do not use the security groups of the cluster for that. If they do, they need to add a null check. Fixes #747

…1411)

This adds a sentence about the enum changes to the migration guide. Those changes are caused by auto-generating the node sdk now.

Added information about how the `VpcCni` component will be replaced by the `VpcCniAddon` component and what effects this has.

This adds an example (and acceptance test) for EKS Network Policies. The configuration is derived from this AWS example: https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy-configure.html

This adds an example (and acceptance test) for the AWS feature: Security Groups for Pods. The configuration is derived from this AWS example: https://docs.aws.amazon.com/eks/latest/userguide/security-groups-pods-deployment.html

The taints for the `ManagedNodeGroup` component were being wrongly calculated when using custom userdata. That was the case because the EKS service uses different capitalization for the taint effect enum than the kubernetes API(e.g. `NO_SCHEDULE` vs `NoSchedule`). When building the custom userdata we need to map the EKS style enums to kubernetes style enums, otherwise it doesn't work. Fixing this also revealed that taint values being absent aren't correctly handled either. The change fixes that as well.

To ease the impact of the breaking API changes caused by generating the node SDK, we decided to add additional scalar inputs that simplify UX across all SDKs (for more details [see internal doc](https://docs.google.com/document/d/1f97nmDUG_nrZSllYxu_XSeI7ON8vhZzfVrdBTQQmZzw/edit#heading=h.fbweiu8gc5bw)). This change adds the scalar properties mentioned in the doc and adds acceptance tests for them. While adding the acceptance tests I noticed that running pods on Fargate doesn't work deterministically. In some cases the cluster fails to get healthy (coredns stuck in pending). This was caused by a race-condition between coredns starting and the fargate profile being created. If the fargate profile deployed after coredns, the pods got stuck in pending because they got assigned to the `default-scheduler` instead of the `fargate-scheduler`. The fix is relatively easy; making coredns depend on the fargate profile. I'll separately update the migration guide. ### New properties | Existing Resource | | New Top Level Property | Description | | :---- | :---- | :---- | :---- | | `clusterSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | | `clusterSecurityGroupId: Output<string>` | Only really useful property of a security group. Used to add additional ingress/egress rules. Default to `the EKS created security group id` | | `nodeSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | | `nodeSecurityGroupId: Output<string>` | | | `eksClusterIngressRule: Output<aws.ec2.SecurityGroupRule \| undefined>` | | `clusterIngressRuleId: Output<string>` | Only really useful property of a rule. Default to `””` | | `defaultNodeGroup: Output<eks.NodeGroupData \| undefined>` | | `defaultNodeGroupAsgName: Output<string>` | The only useful property of the default node group is the auto scaling group. Exposing its name allows users to reference it in IAM roles, tags, etc. Default to `””` | | `core` | `fargateProfile: Output<aws.eks.FargateProfile \| undefined>` | `fargateProfileId: Output<string>` | The id of the fargate profile. Can be used to reference it. Default to `””` | | | | `fargateProfileStatus: Output<string>` | The status of the fargate profile. Default to `””` | | | `oidcProvider: Output<aws.iam.OpenIdConnectProvider \| undefined>` | `oidcProviderArn: Output<string>` & `oidcProviderUrl: Output<string>` & `oidcIssuer: Output<string` | Arn and Url are properties needed to set up IAM identities for pods (required for the assume role policy of the IAM role). Users currently need to trim the `https://` part of the url to actually use it. We should expose `oidcProvider` with that already done to ease usage. | Fixes #1041

This change builds on top of #1445 and makes `NodeGroup` & `NodeGroupV2` accept the scalar security group properties introduced in that PR. This way users can connect their node groups to the cluster without having to use any applies.

Setting public access CIDRs with public access disabled does not work, but the EKS service doesn't validate this case. This can lead (and has) to very confusing debugging sessions à la "why can my IP not access the cluster endpoint, it's included in the public access CIDR range!". This change adds validation for the public access CIDR. Fixes #1436

github-actions · 2024-10-17T14:20:22Z

Does the PR have any schema changes?

Found 17 breaking changes:

Resources

"eks:index:Cluster": required:
- 🟢 "clusterSecurityGroup" property is no longer Required
- 🟢 "eksClusterIngressRule" property is no longer Required
- 🟢 "nodeSecurityGroup" property is no longer Required
🟢 "eks:index:NodeGroup": required: "nodeSecurityGroup" property is no longer Required
🟢 "eks:index:NodeGroupV2": required: "nodeSecurityGroup" property is no longer Required
🔴 "eks:index:VpcCni" missing

Types

"eks:index:CoreData":
- 🟡 properties: "vpcCni" type changed from "#/resources/eks:index:VpcCni" to "#/resources/eks:index:VpcCniAddon"
- 🟢 required: "clusterSecurityGroup" property is no longer Required
"eks:index:NodeGroupData":
- properties:
  - 🟡 "autoScalingGroupName" missing
  - 🟡 "cfnStack" missing
- required:
  - 🟢 "autoScalingGroup" property has changed to Required
  - 🟢 "autoScalingGroupName" property is no longer Required
  - 🟢 "cfnStack" property is no longer Required
"eks:index:VpcCniOptions": properties:
- 🟡 "enableIpv6" missing
- 🟡 "image" missing
- 🟡 "initImage" missing
- 🟡 "nodeAgentImage" missing

New resources:

index.VpcCniAddon

t0yv0

🚢 once failing tests are figured out.

flostadler · 2024-10-17T14:27:18Z

🚢 once failing tests are figured out.

Ah well, forgot the needs major label

#1445 and #1446 introduced new scalar properties as a workaround to the breaking Node.js SDK changes. This documents those in the migration guide.

flostadler · 2024-10-17T16:16:18Z

I also ran another set of upgrade tests using the latest beta release: https://github.com/pulumi/pulumi-eks/releases/tag/v3.0.0-beta.2

I deployed a cluster using the latest v2 version, upgraded it to v3 (without any replacements) and then migrated the cluster to stop using deprecated resources. Worked without hiccups.

pulumi-bot · 2024-10-17T21:43:27Z

This PR has been shipped in release v3.0.0.

flostadler and others added 30 commits October 17, 2024 16:09

Remove AL2 hardcoding to start supporting AL2023 and Bottlerocket (#1328

28a8b50

)

Move VPC CNI to EKS addon (#1358)

f74e712

Ship generated node sdk (#1326)

99be850

closes #603

Update go sdk v3 (#1399)

d48dec2

Go SDK must be versioned as v3 otherwise Go refuses to use this.

Add migration guide for EKS v3 (#1400)

e50f1d6

This change adds the migration guide for EKS v3. We'll also publish this to the docs as part of Relates to pulumi/home#3626, but by having it in the repo we can already send it to alpha users. Relates to pulumi/home#3626

Upgrade javagen to v0.16.1 (#1408)

0e8a91e

Upgrades javagen to v0.16.1. This might solve the problems in #1402

Fix: Do not over-specify SDK versions (#1414)

ef09db8

Re: pulumi/ci-mgmt#1091 This additionally bumps the pu/pu version to 3.135.0.

Do not create instance role when skipDefaultNodeGroup is enabled (#…

a69921e

…1411)

flostadler added 8 commits October 17, 2024 16:10

Add enum changes to migration guide (#1427)

8b9f577

This adds a sentence about the enum changes to the migration guide. Those changes are caused by auto-generating the node sdk now.

Add more information about VPC CNI Addon to migration guide (#1428)

311b9c7

Added information about how the `VpcCni` component will be replaced by the `VpcCniAddon` component and what effects this has.

Add example for EKS feature: Network Policies (#1432)

fde4075

This adds an example (and acceptance test) for EKS Network Policies. The configuration is derived from this AWS example: https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy-configure.html

flostadler requested review from t0yv0 and a team October 17, 2024 14:17

flostadler self-assigned this Oct 17, 2024

t0yv0 approved these changes Oct 17, 2024

View reviewed changes

flostadler added the needs-release/major Marking a PR to compute the next major version label Oct 17, 2024

Add recommendation for scalar output to v3 migration guide (#1448)

37287b1

#1445 and #1446 introduced new scalar properties as a workaround to the breaking Node.js SDK changes. This documents those in the migration guide.

flostadler force-pushed the flostadler/v3-release branch from cbddb89 to 37287b1 Compare October 17, 2024 14:28

flostadler merged commit 147a45b into master Oct 17, 2024
36 checks passed

flostadler deleted the flostadler/v3-release branch October 17, 2024 16:57

github-actions bot removed the needs-release/major Marking a PR to compute the next major version label Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS v3 Release #1453

EKS v3 Release #1453

flostadler commented Oct 17, 2024

github-actions bot commented Oct 17, 2024 •

edited

Loading

t0yv0 left a comment

flostadler commented Oct 17, 2024

flostadler commented Oct 17, 2024

pulumi-bot commented Oct 17, 2024

EKS v3 Release #1453

EKS v3 Release #1453

Conversation

flostadler commented Oct 17, 2024

github-actions bot commented Oct 17, 2024 • edited Loading

Does the PR have any schema changes?

Resources

Types

New resources:

t0yv0 left a comment

Choose a reason for hiding this comment

flostadler commented Oct 17, 2024

flostadler commented Oct 17, 2024

pulumi-bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024 •

edited

Loading