Let all clusters cloud infra support the instance types 4, 16, and 64 CPU highmem nodes #3256
Labels
nominated-to-be-resolved-during-q4-2023
Nomination to be resolved during q4 goal of reducing the technical debt
tech:cloud-infra
Optimization of cloud infra to reduce costs etc.
Its rare for us or the community to know what kind of usage they will have and the resource requests they will and should make in CPU/Memory. Even if we do know that at one point, the requirements often change, for example if a workshop is planned.
Proposal - always setup 4 / 16 / 64 CPU highmem nodes
The proposal is: that we make ~all clusters have three instance types setup 4, 16, 64 CPU with ~32, ~128, ~512 GB of memory.
Practically this doesn't imply any cost or active use of these by itself, only that we will have node pools defined with these kinds of instances and that they are ready to be used without further cloud infra changes if needed.
These are the instance types I suggest we always setup for various cloud providers. They all have 4 / 16 / 64 CPU and a memory specification of 32 / 128 / 512 GB even though the resulting capacity in k8s is slightly different.
n2-highmem-4
,n2-highmem-16
,n2-highmem-64
r5.xlarge
,r5.4xlarge
,r5.16xlarge
Standard_E4a_v4
,Standard_E16_v4
,Standard_E64_v4
Current status
New clusters default to getting node pools like these setup, see here for our terraform gcp's daskhub template and gcp's basehub template, and here for an eksctl template for AWS.
But, we have a few clusters not setup like this, and whenever there is an event, we end up with extra work to figure out how to handle things. When node sharing isn't used, we end up with slow startup times at all times, and that is typically also a bad experience. See https://2i2c.freshdesk.com/a/tickets/1024 about this for carbonplan for example.
Some motivation
I think this has turned out to work out excellent so far and I can motivate this further if needed, but here are some highlights of why I think it makes sense to provide a few instance types for all clusters, such as those proposed.
I think for example Add script to generate resource allocation (nodeshare) choices #3030 would benefit from this
For more motivation, see the duplicate issue #3176 that I forgot I had opened.
Action point
The text was updated successfully, but these errors were encountered: