-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script to generate resource allocation (nodeshare) choices #3030
Conversation
46e470f
to
33a2b30
Compare
Based on discussing how profiles were actually being used with the openscapes folks (2i2c-org#2882) Generated via the scripts in 2i2c-org#3030
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this <3!!! Thank you @yuvipanda
# We operate on *available* memory, which already accounts for system components (like kubelet & systemd) | ||
# as well as daemonsets we run on every node. This represents the resources that are available | ||
# for user pods. | ||
available_node_mem = nodeinfo["available"]["memory"] | ||
available_node_cpu = nodeinfo["available"]["cpu"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change needed - headroom for memory/cpu requests
Motivation
This script is operating on available memory/cpu based on one off measurements, but there are many reasons for why adjusting to this is tricky:
- Instance types
They are relatively easy to adjust to I think, we make a measure in k8s for each node we plan to use - The managed k8s cluster's needs changing
The daemonset's running on each node as managed by the k8s service may vary depending on features enabled (config connector, network policy enforcement, logging), k8s version, and vertical autoscaling determined needs. - Our needs changing
We could end up wanting to add more CPU/RAM requests to the cryptnono daemonset for example, or add another service to run on each user node - then we need to account for it also.
With that in mind, I think we shouldn't try to be so accurate - because if we are, there is no buffer to still manage to schedule a user requesting 100%.
Suggestion
I suggest we look at the following to establish a conservative baseline, and then add some headroom to that, such as 100m CPU and maybe 400MB RAM.
- the capacity of RAM/CPU exposed to k8s pods for various machine types
- the overhead from system pods in GKE, EKS, and AKS respectively - looking at several clusters of each provider to capture variations between enabled features and possibly also k8s versions
- the overhead from our support chart's pods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should add some overhead here. I'll incorporate that. However we should still try to be as accurate as possible - you will note that the code that measures how much is available is taking into consideration all the other factors you listed, including the pods that are run as part of our support charts, and whatever it is that the clusters themselves run. I'll actually work on making this even more automated. The overhead is to allow for drift here, as this can change without us noticing - I'll try work on figuring out how best we can 'notice' and correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@consideRatio 62644fe adds some more flexibility, but much lower than what you recommended. This is because the script already accounts for the three things you have listed as needing headroom. This flexibility is now purely to account for changes that we miss. I'll work on listing when update-nodeinfo needs to be run, have some inline comments in there.
Choice about overlapping requestsIn the setup above, each profile list entry represents a different machine type to run on. This means that you could have requests like ~4 CPU / ~32 GB on a n2-highmem-4 instance and ~4 CPU / 32 GB on a n2-highmem-16 instance that overlaps, resulting in a very similar request, but with different machine type. I think its a good decision to reduce those similar choices to just one, but its not obvious what and it depends on usage patterns - patterns which could also change if its an event or not etc. I think there is an incremental improvement to go for in the future, where we allow the cutoff between two server types to be drawn. For example. 1, 2, 4, 8, 16, 32 GB requests currently go to n2-highmem-4 machines, and 64, 128 requests go to highmem-16 machines, but it could be useful to allow that to slide so that we let the 16 and 32 GB requests end up on the highmem-16 machine instead for example - fitting up to 8 users on those. EDIT: #3262 is a PR to use a 128 GB machine instead of a 32 GB machine by default during an event for resource requests of ~16 and ~32 GB, even though they would fit on a ~32 GB machine - because it puts more users per node and reduces overall startup delay. |
@consideRatio there's two things with respect to overlap:
So the TODO out of here is:
|
Checking in on openscapes since this was deployed to them on Aug 25. Baseline cloud costsSo, openscapes got the older style node sharing when #2684 was merged, on June 21. And on Aug 25, we switched to the setup generated by this script. While there are some confounding factors (primarily, some events), I think the baseline cost has definitely gone way down with this new setup! Startup speedThe baseline cost has come down, but is this at the cost of startup speed? There's no discernable difference in server startup speeds! (Data missing for big parts of July though) Qualitative feedbackTalking to folks in the openscapes slack, there has been a generally positive response to this. End users are less confused about what is needed, and the limits are now visible in JupyterLab. I'll get working on documenting this so others can use it too, but I now believe that for at least openscapes style hubs, this is a good improvement over status quo. |
I would, however, like this to be a little more automated than it is right now. In particular, I don't want us to have to do a big set of manual tweaking of all of these options every time we update node capacity information, as that's error prone and toil-y'. I'll look at ways of making that happen. |
# Add a little bit of wiggle room, to account for: | ||
# 1. Changes in requests for system components as k8s versions upgrade or | ||
# cloud providers roll out new components | ||
# 2. We deploy support components but forget to update node capacity info | ||
# 3. Whatever other things we aren't currently thinking of. | ||
# A small amount of memory and CPU to sacrifice for the sake of | ||
# operational flexibility. However, we *must* regenerate and update node | ||
# information each time the following events occur: | ||
# 1. We upgrade kubernetes versions | ||
# 2. We change resource requirements for *daemonsets* in our support chart. | ||
# 3. We upgrade z2jh version | ||
|
||
# 128 MiB memory buffer | ||
mem_available = mem_available - (128 * 1024 * 1024) | ||
# 0.05 CPU overhead | ||
cpu_available = cpu_available - 0.05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning heavily towards ensuring that we don't run into "fail to start server" issues caused by this over minimizing the headroom. Can we make it at least something like 256 Mi and 0.15 CPU headroom?
I think a memory overhead of 128 is cutting it closer than merited given our nodes have 32GB+ and a failure in getting this right can cause an issue that we won't observe before we have a runtime failure detected by users. By not cutting it close, the memory can also help provide a buffer for when the node includes workloads that don't have requests=limits, so its also not wasted.
I think the cpu overhead is too small as well, mostly because I don't see a good enough motivation of cutting it close since we the CPU overhead won't go wasted as long as we don't put super tight CPU limits without.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for writing this out clearly, @consideRatio.
I think the cpu overhead is too small as well, mostly because I don't see a good enough motivation of cutting it close since we the CPU overhead won't go wasted as long as we don't put super tight CPU limits without.
I think this makes 100% sense for the one strategy introduced in this PR, as it only tackles cpu requests not limits. So I'll increase the headroom to 0.15 (or 0.20) but move the headroom calculation to the strategy code, when we introduce other strategies in the future, they can make their own choices.
I'm trying to understand when exactly this will actually cause a "fail to start server" issue, rather than an issue where utilization of a server is not 100% when packed. I think this only happens at the case where the resource allocation sets guarantee to 100% of available memory. Looking at the set of choices currently deployed to openscapes, that's 2 of the 6 choices. So I understand in those cases, if circumstances change in such a way that the static calculation here for 'available' doesn't match, users would end up with a hanging pod that never gets scheduled anywhere. In the other 4 choices, what will happen instead is wastage of resources, as a new node will be spun up when a user might have already fit in the previous node.
So to handle the memory requests case, I will do the following:
- Increase the general overhead calculation, from which all resource allocations are measured, from 128Mi to 256Mi.
- Specifically for the choices where we are allocating a full node, increase this number even further (probably to 512Mi), as a measurement mismatch here would cause a bigger user problem than with (1).
- Open an issue to consider changing the strategy to have limits that don't count the overhead, and requests that do count the overhead. However, this makes the process a little more complex, and so I'd prefer to not do this on the first run. Sacrificing a little bit of extra RAM for the simplicity seems a worthwhile tradeoff.
I think over time, as we gain more experience with this strategy as well as data, we can fine tune this some more. Ideally, the "mem_available" data would be dynamic, rather than static - I think this would increase confidence level that this drift would not occur. I'll think of ideas how this can be done, but won't block progress on this PR on that.
Does this satisfy you, @consideRatio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for that chat Yuvi!! I opened #3132 to describe my proposed simple strategy
When the end user looks at the profile list, the list needs to be presented in such a way that they can make an informed choice on what to select, with specific behavior that is triggered whenever their usage goes over the selected numbers. Factors ======= - Server startup time! If everyone gets an instance just for themselves, servers take forever to start. Usually, many users are active at the same time, and we can decrease server startup time by putting many users on the same machine in a way they don't step on each others' foot. - Cloud cost. If we pick really large machines, fewer scale up events need to be triggered, so server startup is much faster. However, we pay for instances regardless of how 'full' they are, so if we have a 64GB instance that only has 1GB used, we're paying extra for that. So a trade-off has to be chosen for *machine size*. This can be quantified though, and help make the tradeoff. - Resource *limits*, which the end user can consistently observe. These are easy to explain to end users - if you go over the memory limit, your kernel dies. If you go over the CPU limit, well, you can't - you get throttled. If we set limits appropriately, they will also helpfully show up in the status bar, with [jupyter-resource-usage](https://github.com/jupyter-server/jupyter-resource-usage) - Resource *requests* are harder for end users to observe, as they are primarily meant for the *scheduler*, on how to pack user nodes together for higher utilization. This has an 'oversubscription' factor, relying on the fact that most users don't actually use resources upto their limit. However, this factor varies community to community, and must be carefully tuned. Users may use more resources than they are guaranteed *sometimes*, but then get their kernels killed or CPU throttled at *some other times*, based on what *other* users are doing. This inconsistent behavior is confusing to end users, and we should be careful to figure this out. So in summary, there are two kinds of factors: 1. **Noticeable by users** 1. Server startup time 2. Memory Limit 3. CPU Limit 2. **Noticeable by infrastructure & hub admins**: 1. Cloud cost The *variables* available to Infrastructure Engineers and hub admins to tune are: 1. Size of instances offered 2. "Oversubscription" factor for memory - this is ratio of memory limit to memory guarantee. If users are using memory > guarantee but < limit, they *may* get their kernels killed. Based on our knowledge of this community, we can tune this variable to reduce cloud cost while also reducing disruption in terms of kernels being killed 3. "Oversubscription" factor for CPU. This is easier to handle, as CPUs can be *throttled* easily. A user may use 4 CPUs for a minute, but then go back to 2 cpus next minute without anything being "killed". This is unlike memory, where memory once given can not be taken back. If a user is over the guarantee and another user who is *under* the guarantee needs the memory, the first users's kernel *will* be killed. Since this doesn't happen with CPUs, we can be more liberal in oversubscribing CPUs. Goals ===== The goal is the following: 1. Profile options should be *automatically* generated by a script, with various options to be tuned by the whoever is running it. Engineers should have an easy time making these choices. 2. The *end user* should be able to easily understand the ramifications of the options they choose, and it should be visible to them *after* they start their notebook as well. 3. It's alright for users who want *more resources* to have to wait longer for a server start than users who want fewer resources. This is incentive to start with fewer resources and then size up. Generating Choices ================== This PR adds a new deployer command, `generate-resource-allocation-choices`, to be run by an engineer setting up a hub. It currently supports a *single* node type, and will generate appropriate *Resource Allocation* choices based on a given strategy. This PR implements one specific strategy that has been discussed well to work with the Openscapes community (2i2c-org#2882) and might be useful for other communities as well - the proportionate memory choice. Proportionate Memory Allocation Strategy ======================================== Used primarily in research cases where: 1. Workloads are more memory constrained than CPU constrained 2. End users can be expected to select appropriate amount of memory they need for a given workload, either by their own intrinsic knowledge or instructed by an instructor. It features: 1. No memory overcommit at all, as end users are expected to ask for as much memory as they need. 2. CPU *guarantees* are proportional to amount of memory guarantee - the more memory you ask for, the more CPU you are guaranteed. This allows end users to pick resources purely based on memory only, simplifying the mental model. Also allows for maximum packing of user pods onto a node, as we will *not* run out of CPU on a node before running out of memory. 3. No CPU limits at all, as CPU is a more flexible resource. The CPU guarantee will ensure that users will not be starved of CPU. 4. Each choice the user can make approximately has half as many resources as the next largest choice, with the largest being a full node. This offers a decent compromise - if you pick the largest option, you will most likely have to wait for a full node spawn, while smaller options are much more likely to be shared. In the future, other strategies would be added and experimented with. Node Capacity Information ========================= To generate these choices, we must have Node Capacity Information - particularly, exactly how much RAM and CPU is available for *user pods* on nodes of a particular type. Instead of using heuristics here, we calculate this *accurately*: Resource Available = Node Capacity - System Components (kubelet, systemd, etc) - Daemonsets A json file, `node-capacity-info.json` has this information and is updated with a command `update-node-capacity-info`. This requires a node with the given instance type be actively running so we can perform these calculations. This will need to be recalculated every time we upgrade kubernetes (as system components might take more resources) or adjust resource allocation for our daemonsets. This has been generated in this PR for a couple of common instances. TODO ==== - [ ] Documentation on how to update `node-capacity-info.json` - [ ] Documentation on how to generate choices, and when to use these - [ ] Documentation on how to choose the instance size Co-authored-by: Erik Sundell <erik.i.sundell@gmail.com>
Allows us to construct multiple profile choices with multiple node types.
aa27637
to
73cad86
Compare
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working this @yuvipanda! Based on agreement via other channels, let's merge this and iterate from it in separate PRs over time.
🎉🎉🎉🎉 Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/6624047336 |
When the end user looks at the profile list, the list needs to be presented in such a way that they can make an informed choice on what to select, with specific behavior that is triggered whenever their usage goes over the selected numbers.
Factors
Server startup time! If everyone gets an instance just for themselves, servers take forever to start. Usually, many users are active at the same time, and we can decrease server startup time by putting many users on the same machine in a way they don't step on each others' foot.
Cloud cost. If we pick really large machines, fewer scale up events need to be triggered, so server startup is much faster. However, we pay for instances regardless of how 'full' they are, so if we have a 64GB instance that only has 1GB used, we're paying extra for that. So a trade-off has to be chosen for machine size. This can be quantified though, and help make the tradeoff.
Resource limits, which the end user can consistently observe. These are easy to explain to end users - if you go over the memory limit, your kernel dies. If you go over the CPU limit, well, you can't - you get throttled. If we set limits appropriately, they will also helpfully show up in the status bar, with jupyter-resource-usage
Resource requests are harder for end users to observe, as they are primarily meant for the scheduler, on how to pack user nodes together for higher utilization. This has an 'oversubscription' factor, relying on the fact that most users don't actually use resources upto their limit. However, this factor varies community to community, and must be carefully tuned. Users may use more resources than they are guaranteed sometimes, but then get their kernels killed or CPU throttled at some other times, based on what other users are doing. This inconsistent behavior is confusing to end users, and we should be careful to figure this out.
So in summary, there are two kinds of factors:
Noticeable by users
Noticeable by infrastructure & hub admins:
The variables available to Infrastructure Engineers and hub admins to tune are:
Size of instances offered
"Oversubscription" factor for memory - this is ratio of memory limit to memory guarantee. If users are using memory > guarantee but < limit, they may get their kernels killed. Based on our knowledge of this community, we can tune this variable to reduce cloud cost while also reducing disruption in terms of kernels being killed
"Oversubscription" factor for CPU. This is easier to handle, as CPUs can be throttled easily. A user may use 4 CPUs for a minute, but then go back to 2 cpus next minute without anything being "killed". This is unlike memory, where memory once given can not be taken back. If a user is over the guarantee and another user who is under the guarantee needs the memory, the first users's kernel will be killed. Since this doesn't happen with CPUs, we can be more liberal in oversubscribing CPUs.
Goals
The goal is the following:
Profile options should be automatically generated by a script, with various options to be tuned by the whoever is running it. Engineers should have an easy time making these choices.
The end user should be able to easily understand the ramifications of the options they choose, and it should be visible to them after they start their notebook as well.
It's alright for users who want more resources to have to wait longer for a server start than users who want fewer resources. This is incentive to start with fewer resources and then size up.
Generating Choices
This PR adds a new deployer command,
generate-resource-allocation-choices
, to be run by an engineer setting up a hub. It currently supports a single node type, and will generate appropriate Resource Allocation choices based on a given strategy. This PR implements one specific strategy that has been discussed well to work with the Openscapescommunity (#2882) and might be useful for other communities as well - the proportionate memory choice.
Proportionate Memory Allocation Strategy
Used primarily in research cases where:
It features:
In the future, other strategies would be added and experimented with.
Node Capacity Information
To generate these choices, we must have Node Capacity Information - particularly, exactly how much RAM and CPU is available for user pods on nodes of a particular type. Instead of using heuristics here, we calculate this accurately:
Resource Available = Node Capacity - System Components (kubelet, systemd, etc) - Daemonsets
A json file,
node-capacity-info.json
has this information and is updated with a commandupdate-node-capacity-info
. This requires a node with the given instance type be actively running so we can perform these calculations. This will need to be recalculated every time we upgrade kubernetes (as system components might take more resources) or adjust resource allocation for our daemonsets.This has been generated in this PR for a couple of common instances.
TODO
node-capacity-info.json
Thanks to @consideRatio for working on a lot of this earlier.