GKE autopilot #1600

alimanfoo · 2022-12-13T21:47:34Z

alimanfoo
Dec 13, 2022

I'm sure you're already thinking about this, but just wanted to register that as someone who has been through the process of configuring and deploying qhub and nebari on GKE quite a bit, I am dreaming of how amazing it would be to be able to deploy on GKE autopilot.

From the configuration point of view, there would be no more worrying about setting up different node pools with different machine types then trying to tweak pod resource requirements to maximise utilisation. IIUC with autopilot we could just say what pods need, and not worry about node pools at all, or worry about utilisation because we'd be billed for actual pod resource usage.

From the user point of view, the experience with dask clusters would be awesome. No more waiting for nodes to be provisioned for scheduler or worker pods. Ask for a cluster and get it fully scaled within seconds. This would make use of cluster.adapt() really practical, because there would be no delays due to node pools scaling up and down. We currently don't use cluster.adapt() very often because it really breaks the flow to have to wait for node pools to scale. Without these delays we'd get a great user experience and be able to really trim back costs with clusters scaling up only for the brief bursts when their truly needed.

I'm sure it's not a small ask, but just thought I'd share the dream :-)

iameskild · 2022-12-14T00:13:28Z

iameskild
Dec 14, 2022
Collaborator

Hi @alimanfoo, thanks for sharing the dream!! This definitely looks interesting. I'm curious to hear about your experiences if you've had a chance to give this a try.

From what I can tell in this short article, using this autopilot feature requires adding autopilot = true. It will likely require some testing but it looks fairly straight-forward. Just spit-balling here, perhaps this can be an option in the nebari-config.yaml to ensure we retain the option to use GKE without autopilot.

When/if we make more progress on this, we can option an issue to track the progress.

0 replies

alimanfoo · 2022-12-14T22:12:34Z

alimanfoo
Dec 14, 2022
Author

Hi @iameskild, yes it does look straightforward to create an autopilot cluster. I am way out of my depth here but I guess the question is how you then bypass all the machinery that's currently in nebari for managing nodes, including creating node pools and associating different types of pods with different node pools. I imagine that might take some figuring out.

0 replies

alimanfoo · 2022-12-20T11:25:32Z

alimanfoo
Dec 20, 2022
Author

Having done a little more reading about autopilot, it maybe not quite able to provide the dream user experience for instantly scaling dask clusters up and down with demand, because apparently autopilot is still managing a kubernetes cluster with node pools behind the scenes. And so if you ask to create and scale up a dask cluster, spare capacity may not be available on your cluster and autopilot will need to provision new nodes, which apparently takes around a minute. That's still not bad though, and it would be interesting to compare with gke in standard mode.

Autopilot could still much simplify life for the system maintainer, not having to think about creating different node pools for different use cases and tailoring pod resource requests to maximise utilisation is still a big potential benefit. On our nebari deployment we currently have three different node pools for running user VMs with different combinations of resources, one node pool for running dask schedulers, and two different node pools for running dask workers with different ratios of memory to CPU. I also burned a fair amount of time figuring out which machine types to use and how to size the pod memory and CPU requests to get good utilisation and keep costs down. It was all good learning but not having to worry about any of that would be nice.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nebari-dev

GKE autopilot #1600

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

nebari-dev

GKE autopilot #1600

alimanfoo Dec 13, 2022

Replies: 3 comments

iameskild Dec 14, 2022 Collaborator

alimanfoo Dec 14, 2022 Author

alimanfoo Dec 20, 2022 Author

alimanfoo
Dec 13, 2022

iameskild
Dec 14, 2022
Collaborator

alimanfoo
Dec 14, 2022
Author

alimanfoo
Dec 20, 2022
Author