Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update AAW prod nodepools vm sizing #1967

Open
Tracked by #1966
vexingly opened this issue Sep 18, 2024 · 7 comments
Open
Tracked by #1966

Update AAW prod nodepools vm sizing #1967

vexingly opened this issue Sep 18, 2024 · 7 comments
Assignees

Comments

@vexingly
Copy link

vexingly commented Sep 18, 2024

Let's downsize useruc as we don't have as many users anymore as we used to and most of the resources are being wasted.

The usercpu72 nodes use an old F sku that doesn't provide any advantages, let's put the D64as_v5 in that spot for larger workloads. In the future there may be an F64xx_v6 on the way that would be a better fit, but it isn't in canada central right now.

  1. useruc - Standard_D64as_v5 -> Standard_D16as_v5
  2. userpb - Standard_D16s_v3 -> Standard_D16as_v5
  3. usercpu72uc - Standard_F72s_v2 -> Standard_D64as_v5, rename usercpuuc?
  4. usercpu72pb - Standard_F72s_v2 -> Standard_D64as_v5, rename usercpupb?

Lower priority

  1. storage - remove/unused?
  2. monitoring - remove/unused?

These changes can be made to the aaw-prod teraform here: https://gitlab.k8s.cloud.statcan.ca/cloudnative/aaw/terraform-advanced-analytics-workspaces-infrastructure/-/blob/main/prod_cc_00.tf?ref_type=heads

@vexingly vexingly mentioned this issue Sep 18, 2024
10 tasks
@vexingly vexingly changed the title Update nodepool default node size / replace big-cpu vm series Update AAW prod nodepools vm sizing Sep 18, 2024
@Souheil-Yazji
Copy link
Contributor

To be Deployed after work hours on Wed Sept 25, or Oct 2nd since it's a deployment to prod which will affect nodes.

@EveningStarlight
Copy link

changes are being made in:
https://gitlab.k8s.cloud.statcan.ca/cloudnative/aaw/modules/terraform-azure-statcan-aaw-environment/-/merge_requests/51

Pending review and the correct timeframe for merge

@Souheil-Yazji
Copy link
Contributor

Souheil-Yazji commented Sep 26, 2024

  1. check logic for creating new notebooks which possibly uses the nodepool name (this might be just done with taints/tolartions)
  2. check with census coding team, who is also using f72 nodes that they're good with the changes
  3. check if there's anything particular running on the monitoring/storage nodepools
    a. check argocd manifests (or other repos) for tolerated deployments
    b. inspect nodes on cluster for particular pods with tolerations
    c. match on use is probably the best bet

@EveningStarlight
Copy link

EveningStarlight commented Oct 1, 2024

No examples found using the monitoring nodepool

Storage nodepool found at https://github.com/StatCan/terraform-kubernetes-aks-daaas-private/blob/master/aks.tf#L133
This repo is archived

@EveningStarlight
Copy link

EveningStarlight commented Oct 1, 2024

No use of nodepool names is found in github. Safe to assume relationships are only defined by their taints.

@Souheil-Yazji
Copy link
Contributor

@vexingly

for
usercpu72uc - Standard_F72s_v2 -> Standard_D64as_v5, rename usercpuuc?
usercpu72pb - Standard_F72s_v2 -> Standard_D64as_v5, rename usercpupb?

The Census Chat bot team are using the F72 nodes to host a large notebook which doesn't require too much memory (70cpu/128GB mem). If I push them to the D64 machine, chances are nothing else will be scheduled to that node, which leaves ~120GB mem idle.

I'm thinking we just go ahead and create a new node pool. Thoughts?

@vexingly
Copy link
Author

vexingly commented Oct 2, 2024

A new node pool is probably the easiest path forward right now, we can always clean up the f72 nodes or update them later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants