Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Incident] Grenoble hub (meom-ige dedicated k8s cluster): failure to spawn "very large" or "huge" servers #806

Closed
6 tasks
consideRatio opened this issue Nov 5, 2021 · 4 comments
Assignees

Comments

@consideRatio
Copy link
Contributor

Summary

On 4th November, we got a support ticket in https://2i2c.freshdesk.com/a/tickets/40.

Spawning error for Grenoble hub

I started getting timeout/spawn errors when trying to start a session
with either "very large" or "huge" resource allocations.

It turned out that our change from having node pools on GCP that were of e2 and n2 machine types had a bit different memory per CPU than the n1 type that was switched to in #665. But, since we only switched the instance type, without updating our mem_gurantee, the new n1 nodes couldn't manage to fit a user pod on them, so we got an error about that where the key part is the text pod didn't trigger scale-up.

image

I think this is resolved by @damianavila in #804.

Timeline (if relevant)

I'm opting to not invest the time to write this report that detailed, it's more than I feel is justified.

After-action report

What went wrong

  • Erik didn't understand clearly what "Grenoble hub" referred to, looked inside the 2i2c-org/infrastructure under the pilot-hubs and found a k8s namespace named grenoble. Due to that, I assumed that was where the problem were at. It turned out, that it was in reality in the dedicated k8s cluster, not part of the 2i2c.org managed pilot-hubs k8s cluster.
  • We changed the machine type without updating the associated descriptions in various clusters on the available memory and requested memory and such.

Action items

Process improvements

  1. {{ summary }} [link to github issue]
  2. {{ summary }} [link to github issue]

Documentation improvements

  • Create an inline comment in the Terraform .tfvars files about the machine types, that changing the type should go hand in hand with updating the descriptions in profileList sections about available memory.
  • If we have dedicated machines, it's sufficient to request more than 50% of the available memory to make us avoid having two user pods schedule on them. In our profileList sections, we can add a comment about mem_guarantee about it should be more than 50% of the node's capacity, but less than 100% to ensure it actually fits, and by having it at for example ~75% of the available capacity, we will avoid trouble when using either n1, n2, e2 types.

Technical improvements

Actions

  • Incident has been dealt with or is over
  • Sections above are filled out
  • Incident title and after-action report is cleaned up
  • All actionable items above have linked GitHub Issues
@choldgraf choldgraf changed the title Grenoble hub (meom-ige dedicated k8s cluster): failure to spawn "very large" or "huge" servers [Incident] Grenoble hub (meom-ige dedicated k8s cluster): failure to spawn "very large" or "huge" servers Nov 5, 2021
@choldgraf
Copy link
Member

@yuvipanda also shared this helpful resource on understanding allocatable resources: https://learnk8s.io/allocatable-resources

@damianavila
Copy link
Contributor

damianavila commented Nov 5, 2021

I have merged #804 but the CI failed testing (https://github.com/2i2c-org/infrastructure/runs/4120910918?check_suite_focus=true#step:9:205) after deploying staging, so going with a manual deploy on prod.

@damianavila
Copy link
Contributor

Manual deployment worked (although with a flaky? failed test as well).
Testing the spin-up of different profile options worked OK, included the previously broken very-large option.

$ kubectl get nodes | grep very-large
gke-meom-ige-cluster-nb-very-large-b79ad0a7-3k7n   Ready    <none>   6m40s   v1.19.12-gke.2101

and

$ kubectl get pods -n prod | grep damian
jupyter-damianavila                            1/1     Running   0          6m36s

So, I think we are good now.

PS. I will follow up with test failure in another issue or Slack thread.

@consideRatio
Copy link
Contributor Author

I'll go for a close here since its stale and I don't think it merits further action. I considered the action points listed, and see them as mostly outdated and that we should instead focus on #3030

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants