Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let all clusters cloud infra support the instance types 4, 16, and 64 CPU highmem nodes #3256

Closed
1 task
consideRatio opened this issue Oct 11, 2023 · 0 comments · Fixed by #3319
Closed
1 task
Assignees
Labels
nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt tech:cloud-infra Optimization of cloud infra to reduce costs etc.

Comments

@consideRatio
Copy link
Contributor

consideRatio commented Oct 11, 2023

Its rare for us or the community to know what kind of usage they will have and the resource requests they will and should make in CPU/Memory. Even if we do know that at one point, the requirements often change, for example if a workshop is planned.

Proposal - always setup 4 / 16 / 64 CPU highmem nodes

The proposal is: that we make ~all clusters have three instance types setup 4, 16, 64 CPU with ~32, ~128, ~512 GB of memory.

Practically this doesn't imply any cost or active use of these by itself, only that we will have node pools defined with these kinds of instances and that they are ready to be used without further cloud infra changes if needed.

These are the instance types I suggest we always setup for various cloud providers. They all have 4 / 16 / 64 CPU and a memory specification of 32 / 128 / 512 GB even though the resulting capacity in k8s is slightly different.

  • GKE: n2-highmem-4, n2-highmem-16, n2-highmem-64
  • EKS: r5.xlarge, r5.4xlarge, r5.16xlarge
  • AKS: Standard_E4a_v4, Standard_E16_v4, Standard_E64_v4

Current status

New clusters default to getting node pools like these setup, see here for our terraform gcp's daskhub template and gcp's basehub template, and here for an eksctl template for AWS.

But, we have a few clusters not setup like this, and whenever there is an event, we end up with extra work to figure out how to handle things. When node sharing isn't used, we end up with slow startup times at all times, and that is typically also a bad experience. See https://2i2c.freshdesk.com/a/tickets/1024 about this for carbonplan for example.

Some motivation

I think this has turned out to work out excellent so far and I can motivate this further if needed, but here are some highlights of why I think it makes sense to provide a few instance types for all clusters, such as those proposed.

  • easy to adjust for events where we want to schedule more users than normally on nodes to reduce startup times, which can be bad at other times to ensure efficient scale down to avoid costs
  • easier to document and implement procedures around a standard, as well as verify its robust
    I think for example Add script to generate resource allocation (nodeshare) choices #3030 would benefit from this
  • little extra toil to upgrade another node pool or two during k8s upgrades, but we still should avoid having too many different instance types i think, so having three by default could be a reasonable trade off

For more motivation, see the duplicate issue #3176 that I forgot I had opened.

Action point

  • @2i2c-org/engineering to try to verify their understanding of the proposal, optionall refine it, and then seek to reach agreement on what to do
@consideRatio consideRatio changed the title temp placeholder issue Let all clusters have cloud infra to use three instance types - 4 / 16 / 64 CPU Oct 11, 2023
@consideRatio consideRatio changed the title Let all clusters have cloud infra to use three instance types - 4 / 16 / 64 CPU Let all clusters have cloud infra to use three instance types - 4 / 16 / 64 CPU (highmem nodes) Oct 11, 2023
@consideRatio consideRatio moved this from Needs Shaping / Refinement to Ready to work in DEPRECATED Engineering and Product Backlog Oct 11, 2023
@consideRatio consideRatio added the tech:cloud-infra Optimization of cloud infra to reduce costs etc. label Oct 11, 2023
@consideRatio consideRatio changed the title Let all clusters have cloud infra to use three instance types - 4 / 16 / 64 CPU (highmem nodes) Let all clusters have cloud infra to support three instance types - 4 / 16 / 64 CPU highmem nodes Oct 12, 2023
@consideRatio consideRatio changed the title Let all clusters have cloud infra to support three instance types - 4 / 16 / 64 CPU highmem nodes Let all clusters cloud infra support the instance types 4, 16, and 64 CPU highmem nodes Oct 12, 2023
@consideRatio consideRatio added the nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt label Oct 12, 2023
@damianavila damianavila moved this to Todo 👍 in Sprint Board Oct 23, 2023
@consideRatio consideRatio moved this from Todo 👍 to In Progress ⚡ in Sprint Board Oct 30, 2023
@github-project-automation github-project-automation bot moved this from In Progress ⚡ to Done 🎉 in Sprint Board Oct 31, 2023
@github-project-automation github-project-automation bot moved this from Ready to work to Complete in DEPRECATED Engineering and Product Backlog Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt tech:cloud-infra Optimization of cloud infra to reduce costs etc.
Projects
No open projects
Status: Done 🎉
2 participants