-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS v3 Release #1453
EKS v3 Release #1453
Conversation
This adds the necessary handling for `nodeadm` user data. This is used for AL2023. `nodeadm` is a tool used for bootstrapping kubernetes nodes. It's configuration interface is yaml based and can be set via user data. The user data needs to be in MIME `multipart/mixed` format. This allows interleaving the nodeadm configuration with scripts or other user data entries. See more here: https://awslabs.github.io/amazon-eks-ami/nodeadm/.
### Proposed changes This change adds the necessary options for configuring settings for Bottlerocket Operating Systems. Bottlerocket configuration is driven by a user data script in TOML format. The provider will set the base configuration that's necessary for nodes to successfully register with the kubernetes cluster. Users will have the ability to add additional settings or override the base configuration by using the `bottlerocketSettings` parameter. An overview of the settings can be seen here: https://bottlerocket.dev/en/os/1.20.x/api/settings/. ### New dependencies This adds two dependencies to the provider. 1. `@iarna/toml"`: For converting the configuration into valid TOML. This allows us to expose `bottlerocketSettings` as an object instead of a string so users do not need to worry about TOML formatting&serialization 2. `ipaddr.js`: For calculating the `cluster-dns-ip`. We could definitely create our own IP parser, it's not overly complex for IPv4. But IPv6 can be more tricky with shortened formats. This is unnecessary complexity for the provider. Both of them are maintained libraries with 0 other dependencies. ### Remarks Bottlerocket will only be supported with `NodeGroupV2` and `ManagedNodeGroup` components because the older `NodeGroup` (aka `NodeGroupV1`) uses CloudFormation under the hood and nodes need to signal that they're ready. Bottlerocket can't execute scripts as part of the boot up because it doesn't have a shell, so this is not supported.
#1337) The provider wrongly assumed that an AMI only has a single block device. But Bottlerocket has two. Bottlerocket has two block devices, the root device stores the OS itself and the other is for data like images, logs, persistent storage. We need to allow users to configure the block device for data. With this change, the provider will choose the block device that gets modified depending on the OS. If the OS is bottlerocket, the data device gets modified. This also adds E2E tests verifying that the node storage capacity correctly reflects user settings.
The `ManagedNodeGroup` component was missing configuration the other node groups had. In detail, that's `amiId`, `gpu` and `userData`. Those will allow booting specific/custom AMIs, nodes with GPUs or setting custom user data. The added E2E tests ensure this works as expected. Relates to #1224
…1340) This change adds a new input property called `nodeadmExtraConfig` to the node group components. This property will allow injecting additional nodeadm sections into the user data. This can be virtually anything. Some data, a shell script, or additional nodeadm [`NodeConfig`](https://awslabs.github.io/amazon-eks-ami/nodeadm/). The nodeadm user data is a MIME multipart/mixed document. And every section has string based `content` and a MIME multipart `contentType`. Right now there's no straight forward way to generate types for the nodeadm `NodeConfig` because it's not schematized. Work for enhancing this is tracked here: #1341.
<!--Thanks for your contribution. See [CONTRIBUTING](CONTRIBUTING.md) for Pulumi's contribution guidelines. Help us merge your changes more quickly by adding more details such as labels, milestones, and reviewers.--> ### Proposed changes <!--Give us a brief description of what you've done and what it solves. --> This PR switches the `coredns` and `kube-proxy` addons from self-managed to managed. By default the latest compatible version will be used. This also introduces two new top level arguments to `ClusterOptions` for configuring these new addons. - `corednsAddonOptions` - `kubeProxyAddonOptions` BREAKING CHANGE: creating an `eks.Cluster` will now also create the `coredns` and `kube-proxy` addons. If you are currently already managing these you will need to disable the creation of these through the new arguments `ClusterOptions.corednsAddonOptions.enabled = false` and `ClusterOptions.kubeProxyAddonOptions.enabled = false` ### Related issues (optional) closes #1261, closes #1254
<!--Thanks for your contribution. See [CONTRIBUTING](CONTRIBUTING.md) for Pulumi's contribution guidelines. Help us merge your changes more quickly by adding more details such as labels, milestones, and reviewers.--> ### Proposed changes The coredns managed addon can only be deployed on clusters with default node groups (which includes Fargate clusters). <!--Give us a brief description of what you've done and what it solves. --> ### Related issues (optional) <!--Refer to related PRs or issues: #1234, or 'Fixes #1234' or 'Closes #1234'. Or link to full URLs to issues or pull requests in other GitHub repositories. -->
Now that the EKS addons are added we need to align them and do some cleanup. This involves: - adding the enums introduced in #1357 to the VPC CNI - exposing configurationValues for coredns and kube-proxy - removing kubectl from the provider - deeply sort addon configuration keys to guarantee stable json serialization - remove deepmerge again. It caused issues during unit tests (voodoocreation/ts-deepmerge#22) and when used on outputs. Additionally I discovered and fixed an old bug that luckily never surfaced. The VPC CNI configuration did incorrectly handle outputs and called `toString` on them in a couple of places. The increased type safety and tests around addon configuration uncovered this. Closes #1369
AWS deprecated AL2 and it will be EOL'ed in June 2025. This change marks the AL2 related ami types as deprecated so users are aware of this deprecation. The type `AmiTypes` is not released yet, so this is not a user facing change. As a follow up task we want to publish a migration guide: pulumi/home#3626 Closes #1351
The `NodeGroup` component uses the deprecated AWS Launch Configuration ([see](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-configurations.html)) This marks the legacy CloudFormation based self-managed NodeGroup (also referred to as NodeGroupV1) as deprecated. The Pulumi native NodeGroupV2 is functionally equivalent (same inputs) but doesn't suffer from problems like [pulumi-eks#535](#535). Users will need to replace their self managed node groups anyway to migrate away from AL2 in a safe way (see [What does a node group update look like for users?](https://docs.google.com/document/d/1XyLq_EyAziCp3f6rQ_8qfcUk0RMl8AgUq1mqpqsQcHM/edit#bookmark=id.l6qozya46ay3)). This also switches the default node group of the cluster component to use NodeGroupV2 instead. Closes #1353 Closes #1352
When using [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) for automatically scaling node groups based on cluster requirements pulumi permanently shows a diff. The problem is because the cluster autoscaler takes control of the desiredSize of the scalingConfig, putting it out of sync with Pulumi state. MLCs don't support ignore changes for their children resources, because of this we're adding a new input that allows you to selectively ignore scaling changes. This is not added to the deprecated `NodeGroup` (aka `NodeGroupV1`) because the desired size is part of the cloudformation stack that is deployed as part of that component. We cannot selectively ignore changes to the cloudformation template because its a string. Closes #985 #1293
When configuring Managed Node Groups without a version, they default to using the current cluster version at deploy time. Changes to the clusters version do not propagate to the node group in this case. EKS managed node groups only support 1 minor version skew between the control plane and data plane. Otherwise cluster upgrades will fail. This change makes the ManagedNodeGroups track the cluster version unless a fixed version is provided by users. Fixes #1253
Driven by the deprecation of AL2 by AWS, we need to ensure users are deploying node groups with maintained and secure operating systems by default. This change adds a `RECOMMENDED` OS enum that points to AL2023 (AWS default) and uses it as the default for node groups. The upgrade tests are expected to fail as we're changing defaults. To re-record we need to first release a new baseline version (e.g. alpha release) Closes #1354
The provider was missing a pre-release workflow for publishing alpha/beta versions of the provider. This means that it was always deploying docs changes even for pre-releases. This changes that by conditionally skipping the docs publishing step for prereleases.
The release branch for EKS v3 contains breaking changes now. This causes the upgrade tests to fail. Because of that we disabled the upgrade tests. Once the first alpha version was released we can re-record and re-enable them.
The current default, t2.medium, is an instance type from 2014 that is becoming less common in AWS data centers. This means users will encounter more errors when deploying clusters with the provider when using the default instance type. This mostly affects beginner users, as more experienced users typically do not rely on the default instance types and instead configure appropriate types for their workloads. This change replaces the default t2.medium instances with t3.medium. These newer instances offer better performance and are marginally cheaper ($0.0416 vs. $0.0464 per hour).
It seems that the env cannot be used for controlling whether jobs run. This now uses the underlying expression directly. This is the error we got: ``` The workflow is not valid. .github/workflows/release.yml (Line: 351, Col: 9): Unrecognized named-value: 'env'. Located at position 1 within expression: env.IS_PRERELEASE != 'true' ```
Upgrade publishing workflows to more modern versions borrowing from pulumi/pulumi-aws. Fixes Node SDK publishing. Since Node SDK is now generated under sdk/node and is no longer special compared to other SDKs some changes were needed to the GitHub Actions publishing process to get it to work right.
On master this works: - name: Create GH Release uses: softprops/action-gh-release@v2 with: generate_release_notes: true files: | dist/*.tar.gz prerelease: ${{ env.IS_PRERELEASE }} env: GITHUB_TOKEN: ${{ secrets.PULUMI_BOT_TOKEN }} On release-3.x.x this fails: ``` publish: name: publish needs: - prerequisites - test-nodejs - test-python - test-dotnet - test-go uses: ./.github/workflows/publish.yml secrets: inherit with: version: ${{ needs.prerequisites.outputs.version }} isPrerelease: ${{ env.IS_PRERELEASE }} ``` With: ``` Invalid workflow file: .github/workflows/release.yml#L194 The workflow is not valid. .github/workflows/release.yml (ine: 194, Col: 21): Unrecognized named-value: 'env'. Located at position 1 within expression: env.IS_PRERELEASE ``` Possibly related actions/runner#1189 Working around by in-lining the ENV var.
Go SDK must be versioned as v3 otherwise Go refuses to use this.
This change adds the migration guide for EKS v3. We'll also publish this to the docs as part of Relates to pulumi/home#3626, but by having it in the repo we can already send it to alpha users. Relates to pulumi/home#3626
Upgrades javagen to v0.16.1. This might solve the problems in #1402
Historically the following `NodeGroup` & `NodeGroupV2` input properties have been plain: - `kubeletExtraArgs` - `bootstrapExtraArgs` - `labels` - `taints` - `nodeAssociatePublicIpAddress` Those should instead be inputs so users can pass outputs into them. fixes #1274
Re: pulumi/ci-mgmt#1091 This additionally bumps the pu/pu version to 3.135.0.
Pulumi EKS currently always creates a cluster security group and node security group. - The cluster security group gets assigned to the control plane ENIs in addition to the security group EKS creates (see [AWS Docs](https://docs.aws.amazon.com/eks/latest/userguide/sec-group-reqs.html)). This security group gets an ingress rule from the node security group. - The node security group gets assigned to `NodeGroup` and `NodeGroupV2` components that do not specify a custom security group. Users that either manage the node security themselves or use the `ManagedNodeGroup` component (uses the EKS created SG) do not need those default security groups. This change adds a flag on the cluster (`skipDefaultSecurityGroups`) that will skip creating those default security groups. Instead. This introduces a small breaking change, the `clusterSecurityGroup`, `nodeSecurityGroup` and `clusterIngressRule` outputs are now optional. The impact of this should be minimal because users that create custom node groups usually do not use the security groups of the cluster for that. If they do, they need to add a null check. Fixes #747
This adds a sentence about the enum changes to the migration guide. Those changes are caused by auto-generating the node sdk now.
Added information about how the `VpcCni` component will be replaced by the `VpcCniAddon` component and what effects this has.
This adds an example (and acceptance test) for EKS Network Policies. The configuration is derived from this AWS example: https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy-configure.html
This adds an example (and acceptance test) for the AWS feature: Security Groups for Pods. The configuration is derived from this AWS example: https://docs.aws.amazon.com/eks/latest/userguide/security-groups-pods-deployment.html
The taints for the `ManagedNodeGroup` component were being wrongly calculated when using custom userdata. That was the case because the EKS service uses different capitalization for the taint effect enum than the kubernetes API(e.g. `NO_SCHEDULE` vs `NoSchedule`). When building the custom userdata we need to map the EKS style enums to kubernetes style enums, otherwise it doesn't work. Fixing this also revealed that taint values being absent aren't correctly handled either. The change fixes that as well.
To ease the impact of the breaking API changes caused by generating the node SDK, we decided to add additional scalar inputs that simplify UX across all SDKs (for more details [see internal doc](https://docs.google.com/document/d/1f97nmDUG_nrZSllYxu_XSeI7ON8vhZzfVrdBTQQmZzw/edit#heading=h.fbweiu8gc5bw)). This change adds the scalar properties mentioned in the doc and adds acceptance tests for them. While adding the acceptance tests I noticed that running pods on Fargate doesn't work deterministically. In some cases the cluster fails to get healthy (coredns stuck in pending). This was caused by a race-condition between coredns starting and the fargate profile being created. If the fargate profile deployed after coredns, the pods got stuck in pending because they got assigned to the `default-scheduler` instead of the `fargate-scheduler`. The fix is relatively easy; making coredns depend on the fargate profile. I'll separately update the migration guide. ### New properties | Existing Resource | | New Top Level Property | Description | | :---- | :---- | :---- | :---- | | `clusterSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | | `clusterSecurityGroupId: Output<string>` | Only really useful property of a security group. Used to add additional ingress/egress rules. Default to `the EKS created security group id` | | `nodeSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | | `nodeSecurityGroupId: Output<string>` | | | `eksClusterIngressRule: Output<aws.ec2.SecurityGroupRule \| undefined>` | | `clusterIngressRuleId: Output<string>` | Only really useful property of a rule. Default to `””` | | `defaultNodeGroup: Output<eks.NodeGroupData \| undefined>` | | `defaultNodeGroupAsgName: Output<string>` | The only useful property of the default node group is the auto scaling group. Exposing its name allows users to reference it in IAM roles, tags, etc. Default to `””` | | `core` | `fargateProfile: Output<aws.eks.FargateProfile \| undefined>` | `fargateProfileId: Output<string>` | The id of the fargate profile. Can be used to reference it. Default to `””` | | | | `fargateProfileStatus: Output<string>` | The status of the fargate profile. Default to `””` | | | `oidcProvider: Output<aws.iam.OpenIdConnectProvider \| undefined>` | `oidcProviderArn: Output<string>` & `oidcProviderUrl: Output<string>` & `oidcIssuer: Output<string` | Arn and Url are properties needed to set up IAM identities for pods (required for the assume role policy of the IAM role). Users currently need to trim the `https://` part of the url to actually use it. We should expose `oidcProvider` with that already done to ease usage. | Fixes #1041
This change builds on top of #1445 and makes `NodeGroup` & `NodeGroupV2` accept the scalar security group properties introduced in that PR. This way users can connect their node groups to the cluster without having to use any applies.
Setting public access CIDRs with public access disabled does not work, but the EKS service doesn't validate this case. This can lead (and has) to very confusing debugging sessions à la "why can my IP not access the cluster endpoint, it's included in the public access CIDR range!". This change adds validation for the public access CIDR. Fixes #1436
Does the PR have any schema changes?Found 17 breaking changes: Resources
Types
New resources:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 once failing tests are figured out.
Ah well, forgot the needs major label |
cbddb89
to
37287b1
Compare
I also ran another set of upgrade tests using the latest beta release: https://github.com/pulumi/pulumi-eks/releases/tag/v3.0.0-beta.2 I deployed a cluster using the latest v2 version, upgraded it to v3 (without any replacements) and then migrated the cluster to stop using deprecated resources. Worked without hiccups. |
This PR has been shipped in release v3.0.0. |
This change includes all the changes from the release branch: https://github.com/pulumi/pulumi-eks/tree/release-3.x.x rebased onto the current
master
branch.All changes in this change set are approved PRs that were merged into the release branch.