[Incident] Grenoble hub (meom-ige dedicated k8s cluster): failure to spawn "very large" or "huge" servers #806

consideRatio · 2021-11-05T14:05:53Z

Summary

On 4th November, we got a support ticket in https://2i2c.freshdesk.com/a/tickets/40.

Spawning error for Grenoble hub

I started getting timeout/spawn errors when trying to start a session
with either "very large" or "huge" resource allocations.

It turned out that our change from having node pools on GCP that were of e2 and n2 machine types had a bit different memory per CPU than the n1 type that was switched to in #665. But, since we only switched the instance type, without updating our mem_gurantee, the new n1 nodes couldn't manage to fit a user pod on them, so we got an error about that where the key part is the text pod didn't trigger scale-up.

I think this is resolved by @damianavila in #804.

Timeline (if relevant)

I'm opting to not invest the time to write this report that detailed, it's more than I feel is justified.

After-action report

What went wrong

Erik didn't understand clearly what "Grenoble hub" referred to, looked inside the 2i2c-org/infrastructure under the pilot-hubs and found a k8s namespace named grenoble. Due to that, I assumed that was where the problem were at. It turned out, that it was in reality in the dedicated k8s cluster, not part of the 2i2c.org managed pilot-hubs k8s cluster.
We changed the machine type without updating the associated descriptions in various clusters on the available memory and requested memory and such.

Action items

Process improvements

{{ summary }} [link to github issue]
{{ summary }} [link to github issue]

Documentation improvements

Create an inline comment in the Terraform .tfvars files about the machine types, that changing the type should go hand in hand with updating the descriptions in profileList sections about available memory.
If we have dedicated machines, it's sufficient to request more than 50% of the available memory to make us avoid having two user pods schedule on them. In our profileList sections, we can add a comment about mem_guarantee about it should be more than 50% of the node's capacity, but less than 100% to ensure it actually fits, and by having it at for example ~75% of the available capacity, we will avoid trouble when using either n1, n2, e2 types.

Technical improvements

Actions

Incident has been dealt with or is over
Sections above are filled out
Incident title and after-action report is cleaned up
All actionable items above have linked GitHub Issues

The text was updated successfully, but these errors were encountered:

choldgraf · 2021-11-05T16:25:49Z

@yuvipanda also shared this helpful resource on understanding allocatable resources: https://learnk8s.io/allocatable-resources

damianavila · 2021-11-05T19:19:13Z

I have merged #804 but the CI failed testing (https://github.com/2i2c-org/infrastructure/runs/4120910918?check_suite_focus=true#step:9:205) after deploying staging, so going with a manual deploy on prod.

damianavila · 2021-11-05T19:58:30Z

Manual deployment worked (although with a flaky? failed test as well).
Testing the spin-up of different profile options worked OK, included the previously broken very-large option.

$ kubectl get nodes | grep very-large
gke-meom-ige-cluster-nb-very-large-b79ad0a7-3k7n   Ready    <none>   6m40s   v1.19.12-gke.2101

and

$ kubectl get pods -n prod | grep damian
jupyter-damianavila                            1/1     Running   0          6m36s

So, I think we are good now.

PS. I will follow up with test failure in another issue or Slack thread.

consideRatio · 2023-10-11T12:42:48Z

I'll go for a close here since its stale and I don't think it merits further action. I considered the action points listed, and see them as mostly outdated and that we should instead focus on #3030

consideRatio added the type: Hub Incident label Nov 5, 2021

choldgraf changed the title ~~Grenoble hub (meom-ige dedicated k8s cluster): failure to spawn "very large" or "huge" servers~~ [Incident] Grenoble hub (meom-ige dedicated k8s cluster): failure to spawn "very large" or "huge" servers Nov 5, 2021

damianavila mentioned this issue Nov 8, 2021

Team Sync - Monday, November 8th 2i2c-org/team-compass#288

Closed

choldgraf removed the type: Hub Incident label Sep 16, 2022

consideRatio closed this as completed Oct 11, 2023

damianavila assigned consideRatio Oct 20, 2023

damianavila added this to DEPRECATED Engineering and Product Backlog Oct 20, 2023

github-project-automation bot moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Oct 20, 2023

damianavila added this to Sprint Board Oct 20, 2023

damianavila moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog Oct 20, 2023

damianavila moved this to Done 🎉 in Sprint Board Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Incident] Grenoble hub (meom-ige dedicated k8s cluster): failure to spawn "very large" or "huge" servers #806

[Incident] Grenoble hub (meom-ige dedicated k8s cluster): failure to spawn "very large" or "huge" servers #806

consideRatio commented Nov 5, 2021

choldgraf commented Nov 5, 2021

damianavila commented Nov 5, 2021 •

edited

Loading

damianavila commented Nov 5, 2021

consideRatio commented Oct 11, 2023

[Incident] Grenoble hub (meom-ige dedicated k8s cluster): failure to spawn "very large" or "huge" servers #806

[Incident] Grenoble hub (meom-ige dedicated k8s cluster): failure to spawn "very large" or "huge" servers #806

Comments

consideRatio commented Nov 5, 2021

Summary

Timeline (if relevant)

After-action report

What went wrong

Action items

Process improvements

Documentation improvements

Technical improvements

Actions

choldgraf commented Nov 5, 2021

damianavila commented Nov 5, 2021 • edited Loading

damianavila commented Nov 5, 2021

consideRatio commented Oct 11, 2023

damianavila commented Nov 5, 2021 •

edited

Loading