-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Incident] Grenoble hub (meom-ige dedicated k8s cluster): failure to spawn "very large" or "huge" servers #806
Comments
@yuvipanda also shared this helpful resource on understanding allocatable resources: https://learnk8s.io/allocatable-resources |
I have merged #804 but the CI failed testing (https://github.com/2i2c-org/infrastructure/runs/4120910918?check_suite_focus=true#step:9:205) after deploying |
Manual deployment worked (although with a flaky? failed test as well). $ kubectl get nodes | grep very-large
gke-meom-ige-cluster-nb-very-large-b79ad0a7-3k7n Ready <none> 6m40s v1.19.12-gke.2101 and $ kubectl get pods -n prod | grep damian
jupyter-damianavila 1/1 Running 0 6m36s So, I think we are good now. PS. I will follow up with test failure in another issue or Slack thread. |
I'll go for a close here since its stale and I don't think it merits further action. I considered the action points listed, and see them as mostly outdated and that we should instead focus on #3030 |
Summary
On 4th November, we got a support ticket in https://2i2c.freshdesk.com/a/tickets/40.
It turned out that our change from having node pools on GCP that were of
e2
andn2
machine types had a bit different memory per CPU than then1
type that was switched to in #665. But, since we only switched the instance type, without updating ourmem_gurantee
, the newn1
nodes couldn't manage to fit a user pod on them, so we got an error about that where the key part is the textpod didn't trigger scale-up
.I think this is resolved by @damianavila in #804.
Timeline (if relevant)
I'm opting to not invest the time to write this report that detailed, it's more than I feel is justified.
After-action report
What went wrong
grenoble
. Due to that, I assumed that was where the problem were at. It turned out, that it was in reality in the dedicated k8s cluster, not part of the 2i2c.org managed pilot-hubs k8s cluster.Action items
Process improvements
Documentation improvements
Technical improvements
Actions
The text was updated successfully, but these errors were encountered: