Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS v3 Release #1453

Merged
merged 39 commits into from
Oct 17, 2024
Merged

EKS v3 Release #1453

merged 39 commits into from
Oct 17, 2024

Conversation

flostadler
Copy link
Contributor

This change includes all the changes from the release branch: https://github.com/pulumi/pulumi-eks/tree/release-3.x.x rebased onto the current master branch.

All changes in this change set are approved PRs that were merged into the release branch.

flostadler and others added 30 commits October 17, 2024 16:09
This adds the necessary handling for `nodeadm` user data. This is used
for AL2023.

`nodeadm` is a tool used for bootstrapping kubernetes nodes. It's
configuration interface is yaml based and can be set via user data. The
user data needs to be in MIME `multipart/mixed` format. This allows
interleaving the nodeadm configuration with scripts or other user data
entries. See more here:
https://awslabs.github.io/amazon-eks-ami/nodeadm/.
### Proposed changes

This change adds the necessary options for configuring settings for
Bottlerocket Operating Systems.
Bottlerocket configuration is driven by a user data script in TOML
format.

The provider will set the base configuration that's necessary for nodes
to successfully register with the kubernetes cluster. Users will have
the ability to add additional settings or override the base
configuration by using the `bottlerocketSettings` parameter. An overview
of the settings can be seen here:
https://bottlerocket.dev/en/os/1.20.x/api/settings/.

### New dependencies
This adds two dependencies to the provider.
1. `@iarna/toml"`: For converting the configuration into valid TOML.
This allows us to expose `bottlerocketSettings` as an object instead of
a string so users do not need to worry about TOML
formatting&serialization
2. `ipaddr.js`: For calculating the `cluster-dns-ip`. We could
definitely create our own IP parser, it's not overly complex for IPv4.
But IPv6 can be more tricky with shortened formats. This is unnecessary
complexity for the provider.

Both of them are maintained libraries with 0 other dependencies.

### Remarks
Bottlerocket will only be supported with `NodeGroupV2` and
`ManagedNodeGroup` components because the older `NodeGroup` (aka
`NodeGroupV1`) uses CloudFormation under the hood and nodes need to
signal that they're ready. Bottlerocket can't execute scripts as part of
the boot up because it doesn't have a shell, so this is not supported.
#1337)

The provider wrongly assumed that an AMI only has a single block device.
But Bottlerocket has two.
Bottlerocket has two block devices, the root device stores the OS itself
and the other is for data like images, logs, persistent storage. We need
to allow users to configure the block device for data.

With this change, the provider will choose the block device that gets
modified depending on the OS. If the OS is bottlerocket, the data device
gets modified.
This also adds E2E tests verifying that the node storage capacity
correctly reflects user settings.
The `ManagedNodeGroup` component was missing configuration the other
node groups had.
In detail, that's `amiId`, `gpu` and `userData`.

Those will allow booting specific/custom AMIs, nodes with GPUs or
setting custom user data.
The added E2E tests ensure this works as expected.

Relates to #1224
…1340)

This change adds a new input property called `nodeadmExtraConfig` to the
node group components. This property will allow injecting additional
nodeadm sections into the user data.
This can be virtually anything. Some data, a shell script, or additional
nodeadm
[`NodeConfig`](https://awslabs.github.io/amazon-eks-ami/nodeadm/).

The nodeadm user data is a MIME multipart/mixed document. And every
section has string based `content` and a MIME multipart `contentType`.

Right now there's no straight forward way to generate types for the
nodeadm `NodeConfig` because it's not schematized. Work for enhancing
this is tracked here: #1341.
<!--Thanks for your contribution. See [CONTRIBUTING](CONTRIBUTING.md)
    for Pulumi's contribution guidelines.

    Help us merge your changes more quickly by adding more details such
    as labels, milestones, and reviewers.-->

### Proposed changes

<!--Give us a brief description of what you've done and what it solves.
-->

This PR switches the `coredns` and `kube-proxy` addons from self-managed
to managed. By default the latest compatible version will be used.

This also introduces two new top level arguments to `ClusterOptions` for
configuring these new addons.

- `corednsAddonOptions`
- `kubeProxyAddonOptions`

BREAKING CHANGE: creating an `eks.Cluster` will now also create the
`coredns` and `kube-proxy` addons. If you are currently already managing
these you will need to disable the creation of these through the new
arguments `ClusterOptions.corednsAddonOptions.enabled = false` and
`ClusterOptions.kubeProxyAddonOptions.enabled = false`

### Related issues (optional)

closes #1261, closes #1254
<!--Thanks for your contribution. See [CONTRIBUTING](CONTRIBUTING.md)
    for Pulumi's contribution guidelines.

    Help us merge your changes more quickly by adding more details such
    as labels, milestones, and reviewers.-->

### Proposed changes

The coredns managed addon can only be deployed on clusters with default
node
groups (which includes Fargate clusters).

<!--Give us a brief description of what you've done and what it solves.
-->

### Related issues (optional)

<!--Refer to related PRs or issues: #1234, or 'Fixes #1234' or 'Closes
#1234'.
Or link to full URLs to issues or pull requests in other GitHub
repositories. -->
Now that the EKS addons are added we need to align them and do some
cleanup. This involves:
- adding the enums introduced in
#1357 to the VPC CNI
- exposing configurationValues for coredns and kube-proxy
- removing kubectl from the provider
- deeply sort addon configuration keys to guarantee stable json
serialization
- remove deepmerge again. It caused issues during unit tests
(voodoocreation/ts-deepmerge#22) and when used
on outputs.

Additionally I discovered and fixed an old bug that luckily never
surfaced. The VPC CNI configuration did incorrectly handle outputs and
called `toString` on them in a couple of places. The increased type
safety and tests around addon configuration uncovered this.

Closes #1369
AWS deprecated AL2 and it will be EOL'ed in June 2025. This change marks
the AL2 related ami types as deprecated so users are aware of this
deprecation.

The type `AmiTypes` is not released yet, so this is not a user facing
change.

As a follow up task we want to publish a migration guide:
pulumi/home#3626

Closes #1351
The `NodeGroup` component uses the deprecated AWS Launch Configuration
([see](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-configurations.html))

This marks the legacy CloudFormation based self-managed NodeGroup (also
referred to as NodeGroupV1) as deprecated. The Pulumi native NodeGroupV2
is functionally equivalent (same inputs) but doesn't suffer from
problems like
[pulumi-eks#535](#535). Users
will need to replace their self managed node groups anyway to migrate
away from AL2 in a safe way (see [What does a node group update look
like for
users?](https://docs.google.com/document/d/1XyLq_EyAziCp3f6rQ_8qfcUk0RMl8AgUq1mqpqsQcHM/edit#bookmark=id.l6qozya46ay3)).

This also switches the default node group of the cluster component to
use NodeGroupV2 instead.

Closes #1353 
Closes #1352
When using [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler)
for automatically scaling node groups based on cluster requirements pulumi permanently shows a diff.

The problem is because the cluster autoscaler takes control of the
desiredSize of the scalingConfig, putting it out of sync with Pulumi
state.

MLCs don't support ignore changes for their children resources, because
of this we're adding a new input that allows you to selectively ignore
scaling changes.
This is not added to the deprecated `NodeGroup` (aka `NodeGroupV1`)
because the desired size is part of the cloudformation stack that is
deployed as part of that component. We cannot selectively ignore changes
to the cloudformation template because its a string.

Closes #985 #1293
When configuring Managed Node Groups without a version, they default to
using the
current cluster version at deploy time. Changes to the clusters version
do not
propagate to the node group in this case.
EKS managed node groups only support 1 minor version skew between the
control plane and data plane.
Otherwise cluster upgrades will fail.

This change makes the ManagedNodeGroups track the cluster version unless
a fixed version is provided by users.

Fixes #1253
Driven by the deprecation of AL2 by AWS, we need to ensure users are
deploying node groups with maintained and secure operating systems by
default.
This change adds a `RECOMMENDED` OS enum that points to AL2023 (AWS
default) and uses it as the default for node groups.

The upgrade tests are expected to fail as we're changing defaults. To
re-record we need to first release a new baseline version (e.g. alpha
release)

Closes #1354
The provider was missing a pre-release workflow for publishing
alpha/beta versions of the provider.
This means that it was always deploying docs changes even for
pre-releases.

This changes that by conditionally skipping the docs publishing step for
prereleases.
The release branch for EKS v3 contains breaking changes now. This causes
the upgrade tests to fail. Because of that we disabled the upgrade
tests.
Once the first alpha version was released we can re-record and re-enable
them.
The current default, t2.medium, is an instance type from 2014 that is
becoming less common in AWS data centers. This means users will
encounter more errors when deploying clusters with the provider when
using the default instance type.

This mostly affects beginner users, as more experienced users typically
do not rely on the default instance types and instead configure
appropriate types for their workloads.

This change replaces the default t2.medium instances with t3.medium.
These newer instances offer better performance and are marginally
cheaper ($0.0416 vs. $0.0464 per hour).
It seems that the env cannot be used for controlling whether jobs run.
This now uses the underlying expression directly.

This is the error we got:
```
The workflow is not valid. .github/workflows/release.yml (Line: 351, Col: 9): Unrecognized named-value: 'env'. Located at position 1 within expression: env.IS_PRERELEASE != 'true'
```
Upgrade publishing workflows to more modern versions borrowing from
pulumi/pulumi-aws. Fixes Node SDK publishing. Since Node SDK is now
generated under sdk/node and is no longer special compared to other SDKs
some changes were needed to the GitHub Actions publishing process to get
it to work right.
On master this works:

      - name: Create GH Release
        uses: softprops/action-gh-release@v2
        with:
          generate_release_notes: true
          files: |
            dist/*.tar.gz
          prerelease: ${{ env.IS_PRERELEASE }}
        env:
          GITHUB_TOKEN: ${{ secrets.PULUMI_BOT_TOKEN }}

On release-3.x.x this fails:

```
  publish:
    name: publish
    needs:
      - prerequisites
      - test-nodejs
      - test-python
      - test-dotnet
      - test-go
    uses: ./.github/workflows/publish.yml
    secrets: inherit
    with:
      version: ${{ needs.prerequisites.outputs.version }}
      isPrerelease: ${{ env.IS_PRERELEASE }}
```

With:

```
 Invalid workflow file: .github/workflows/release.yml#L194
 The workflow is not valid. .github/workflows/release.yml (ine: 194, Col: 21):
 Unrecognized named-value: 'env'.
 Located at position 1 within expression: env.IS_PRERELEASE
```

Possibly related actions/runner#1189

Working around by in-lining the ENV var.
Go SDK must be versioned as v3 otherwise Go refuses to use this.
This change adds the migration guide for EKS v3. We'll also publish this
to the docs as part of Relates to
pulumi/home#3626, but by having it in the repo
we can already send it to alpha users.

Relates to pulumi/home#3626
Upgrades javagen to v0.16.1. This might solve the problems in #1402
…1410)

In #1373 the default node group
was updated to use the `NodeGroupV2` component. We missed changing the
`NodeGroupData` type to reflect this. It was still referring to a
property called `autoScalingGroupName`, but it should've been changed to
expose an `autoScalingGroup`.

Fixes #1402
Historically the following `NodeGroup` & `NodeGroupV2` input properties
have been plain:
- `kubeletExtraArgs`
- `bootstrapExtraArgs`
- `labels`
- `taints`
- `nodeAssociatePublicIpAddress`

Those should instead be inputs so users can pass outputs into them.

fixes #1274
Re: pulumi/ci-mgmt#1091

This additionally bumps the pu/pu version to 3.135.0.
Pulumi EKS currently always creates a cluster security group and node
security group.
- The cluster security group gets assigned to the control plane ENIs in
addition to the security group EKS creates (see [AWS
Docs](https://docs.aws.amazon.com/eks/latest/userguide/sec-group-reqs.html)).
This security group gets an ingress rule from the node security group.
- The node security group gets assigned to `NodeGroup` and `NodeGroupV2`
components that do not specify a custom security group.

Users that either manage the node security themselves or use the
`ManagedNodeGroup` component (uses the EKS created SG) do not need those
default security groups.

This change adds a flag on the cluster (`skipDefaultSecurityGroups`)
that will skip creating those default security groups. Instead.

This introduces a small breaking change, the `clusterSecurityGroup`,
`nodeSecurityGroup` and `clusterIngressRule` outputs are now optional.
The impact of this should be minimal because users that create custom
node groups usually do not use the security groups of the cluster for
that. If they do, they need to add a null check.

Fixes #747
This adds a sentence about the enum changes to the migration guide.
Those changes are caused by auto-generating the node sdk now.
Added information about how the `VpcCni` component will be replaced by
the `VpcCniAddon` component and what effects this has.
This adds an example (and acceptance test) for EKS Network Policies.

The configuration is derived from this AWS example:
https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy-configure.html
This adds an example (and acceptance test) for the AWS feature: Security
Groups for Pods.

The configuration is derived from this AWS example:
https://docs.aws.amazon.com/eks/latest/userguide/security-groups-pods-deployment.html
The taints for the `ManagedNodeGroup` component were being wrongly
calculated when using custom userdata.
That was the case because the EKS service uses different capitalization
for the taint effect enum than the kubernetes API(e.g. `NO_SCHEDULE` vs
`NoSchedule`).
When building the custom userdata we need to map the EKS style enums to
kubernetes style enums, otherwise it doesn't work.

Fixing this also revealed that taint values being absent aren't
correctly handled either. The change fixes that as well.
To ease the impact of the breaking API changes caused by generating the
node SDK, we decided to add additional scalar inputs that simplify UX
across all SDKs (for more details [see internal
doc](https://docs.google.com/document/d/1f97nmDUG_nrZSllYxu_XSeI7ON8vhZzfVrdBTQQmZzw/edit#heading=h.fbweiu8gc5bw)).

This change adds the scalar properties mentioned in the doc and adds
acceptance tests for them.
While adding the acceptance tests I noticed that running pods on Fargate
doesn't work deterministically. In some cases the cluster fails to get
healthy (coredns stuck in pending).
This was caused by a race-condition between coredns starting and the
fargate profile being created. If the fargate profile deployed after
coredns, the pods got stuck in pending because they got assigned to the
`default-scheduler` instead of the `fargate-scheduler`.
The fix is relatively easy; making coredns depend on the fargate
profile.

I'll separately update the migration guide.

### New properties

| Existing Resource |  | New Top Level Property | Description |
| :---- | :---- | :---- | :---- |
| `clusterSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | |
`clusterSecurityGroupId: Output<string>` | Only really useful property
of a security group. Used to add additional ingress/egress rules.
Default to `the EKS created security group id` |
| `nodeSecurityGroup: Output<aws.ec2.SecurityGroup \| undefined>` | |
`nodeSecurityGroupId: Output<string>` | |
| `eksClusterIngressRule: Output<aws.ec2.SecurityGroupRule \|
undefined>` | | `clusterIngressRuleId: Output<string>` | Only really
useful property of a rule. Default to `””` |
| `defaultNodeGroup: Output<eks.NodeGroupData \| undefined>` | |
`defaultNodeGroupAsgName: Output<string>` | The only useful property of
the default node group is the auto scaling group. Exposing its name
allows users to reference it in IAM roles, tags, etc. Default to `””` |
| `core` | `fargateProfile: Output<aws.eks.FargateProfile \| undefined>`
| `fargateProfileId: Output<string>` | The id of the fargate profile.
Can be used to reference it. Default to `””` |
| | | `fargateProfileStatus: Output<string>` | The status of the fargate
profile. Default to `””` |
| | `oidcProvider: Output<aws.iam.OpenIdConnectProvider \| undefined>` |
`oidcProviderArn: Output<string>` & `oidcProviderUrl: Output<string>` &
`oidcIssuer: Output<string` | Arn and Url are properties needed to set
up IAM identities for pods (required for the assume role policy of the
IAM role). Users currently need to trim the `https://` part of the url
to actually use it. We should expose `oidcProvider` with that already
done to ease usage. |


Fixes #1041
This change builds on top of
#1445 and makes `NodeGroup` &
`NodeGroupV2` accept the scalar security group properties introduced in
that PR.

This way users can connect their node groups to the cluster without
having to use any applies.
Setting public access CIDRs with public access disabled does not work,
but the EKS service doesn't validate this case.
This can lead (and has) to very confusing debugging sessions à la "why
can my IP not access the cluster endpoint, it's included in the public
access CIDR range!".

This change adds validation for the public access CIDR.

Fixes #1436
@flostadler flostadler requested review from t0yv0 and a team October 17, 2024 14:17
@flostadler flostadler self-assigned this Oct 17, 2024
Copy link

github-actions bot commented Oct 17, 2024

Does the PR have any schema changes?

Found 17 breaking changes:

Resources

  • "eks:index:Cluster": required:
    • 🟢 "clusterSecurityGroup" property is no longer Required
    • 🟢 "eksClusterIngressRule" property is no longer Required
    • 🟢 "nodeSecurityGroup" property is no longer Required
  • 🟢 "eks:index:NodeGroup": required: "nodeSecurityGroup" property is no longer Required
  • 🟢 "eks:index:NodeGroupV2": required: "nodeSecurityGroup" property is no longer Required
  • 🔴 "eks:index:VpcCni" missing

Types

  • "eks:index:CoreData":
    • 🟡 properties: "vpcCni" type changed from "#/resources/eks:index:VpcCni" to "#/resources/eks:index:VpcCniAddon"
    • 🟢 required: "clusterSecurityGroup" property is no longer Required
  • "eks:index:NodeGroupData":
    • properties:
      • 🟡 "autoScalingGroupName" missing
      • 🟡 "cfnStack" missing
    • required:
      • 🟢 "autoScalingGroup" property has changed to Required
      • 🟢 "autoScalingGroupName" property is no longer Required
      • 🟢 "cfnStack" property is no longer Required
  • "eks:index:VpcCniOptions": properties:
    • 🟡 "enableIpv6" missing
    • 🟡 "image" missing
    • 🟡 "initImage" missing
    • 🟡 "nodeAgentImage" missing

New resources:

  • index.VpcCniAddon

Copy link
Member

@t0yv0 t0yv0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 once failing tests are figured out.

@flostadler
Copy link
Contributor Author

🚢 once failing tests are figured out.

Ah well, forgot the needs major label

@flostadler flostadler added the needs-release/major Marking a PR to compute the next major version label Oct 17, 2024
#1445 and
#1446 introduced new scalar
properties as a workaround to the breaking Node.js SDK changes.

This documents those in the migration guide.
@flostadler
Copy link
Contributor Author

I also ran another set of upgrade tests using the latest beta release: https://github.com/pulumi/pulumi-eks/releases/tag/v3.0.0-beta.2

I deployed a cluster using the latest v2 version, upgraded it to v3 (without any replacements) and then migrated the cluster to stop using deprecated resources. Worked without hiccups.

@flostadler flostadler merged commit 147a45b into master Oct 17, 2024
36 checks passed
@flostadler flostadler deleted the flostadler/v3-release branch October 17, 2024 16:57
@pulumi-bot
Copy link
Contributor

This PR has been shipped in release v3.0.0.

@github-actions github-actions bot removed the needs-release/major Marking a PR to compute the next major version label Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants