Replies: 3 comments
-
Hi @alimanfoo, thanks for sharing the dream!! This definitely looks interesting. I'm curious to hear about your experiences if you've had a chance to give this a try. From what I can tell in this short article, using this autopilot feature requires adding When/if we make more progress on this, we can option an issue to track the progress. |
Beta Was this translation helpful? Give feedback.
-
Hi @iameskild, yes it does look straightforward to create an autopilot cluster. I am way out of my depth here but I guess the question is how you then bypass all the machinery that's currently in nebari for managing nodes, including creating node pools and associating different types of pods with different node pools. I imagine that might take some figuring out. |
Beta Was this translation helpful? Give feedback.
-
Having done a little more reading about autopilot, it maybe not quite able to provide the dream user experience for instantly scaling dask clusters up and down with demand, because apparently autopilot is still managing a kubernetes cluster with node pools behind the scenes. And so if you ask to create and scale up a dask cluster, spare capacity may not be available on your cluster and autopilot will need to provision new nodes, which apparently takes around a minute. That's still not bad though, and it would be interesting to compare with gke in standard mode. Autopilot could still much simplify life for the system maintainer, not having to think about creating different node pools for different use cases and tailoring pod resource requests to maximise utilisation is still a big potential benefit. On our nebari deployment we currently have three different node pools for running user VMs with different combinations of resources, one node pool for running dask schedulers, and two different node pools for running dask workers with different ratios of memory to CPU. I also burned a fair amount of time figuring out which machine types to use and how to size the pod memory and CPU requests to get good utilisation and keep costs down. It was all good learning but not having to worry about any of that would be nice. |
Beta Was this translation helpful? Give feedback.
-
I'm sure you're already thinking about this, but just wanted to register that as someone who has been through the process of configuring and deploying qhub and nebari on GKE quite a bit, I am dreaming of how amazing it would be to be able to deploy on GKE autopilot.
From the configuration point of view, there would be no more worrying about setting up different node pools with different machine types then trying to tweak pod resource requirements to maximise utilisation. IIUC with autopilot we could just say what pods need, and not worry about node pools at all, or worry about utilisation because we'd be billed for actual pod resource usage.
From the user point of view, the experience with dask clusters would be awesome. No more waiting for nodes to be provisioned for scheduler or worker pods. Ask for a cluster and get it fully scaled within seconds. This would make use of cluster.adapt() really practical, because there would be no delays due to node pools scaling up and down. We currently don't use cluster.adapt() very often because it really breaks the flow to have to wait for node pools to scale. Without these delays we'd get a great user experience and be able to really trim back costs with clusters scaling up only for the brief bursts when their truly needed.
I'm sure it's not a small ask, but just thought I'd share the dream :-)
Beta Was this translation helpful? Give feedback.
All reactions