Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add script to generate nodeshare choices
When the end user looks at the profile list, the list needs to be presented in such a way that they can make an informed choice on what to select, with specific behavior that is triggered whenever their usage goes over the selected numbers. Factors ======= - Server startup time! If everyone gets an instance just for themselves, servers take forever to start. Usually, many users are active at the same time, and we can decrease server startup time by putting many users on the same machine in a way they don't step on each others' foot. - Cloud cost. If we pick really large machines, fewer scale up events need to be triggered, so server startup is much faster. However, we pay for instances regardless of how 'full' they are, so if we have a 64GB instance that only has 1GB used, we're paying extra for that. So a trade-off has to be chosen for *machine size*. This can be quantified though, and help make the tradeoff. - Resource *limits*, which the end user can consistently observe. These are easy to explain to end users - if you go over the memory limit, your kernel dies. If you go over the CPU limit, well, you can't - you get throttled. If we set limits appropriately, they will also helpfully show up in the status bar, with [jupyter-resource-usage](https://github.com/jupyter-server/jupyter-resource-usage) - Resource *requests* are harder for end users to observe, as they are primarily meant for the *scheduler*, on how to pack user nodes together for higher utilization. This has an 'oversubscription' factor, relying on the fact that most users don't actually use resources upto their limit. However, this factor varies community to community, and must be carefully tuned. Users may use more resources than they are guaranteed *sometimes*, but then get their kernels killed or CPU throttled at *some other times*, based on what *other* users are doing. This inconsistent behavior is confusing to end users, and we should be careful to figure this out. So in summary, there are two kinds of factors: 1. **Noticeable by users** 1. Server startup time 2. Memory Limit 3. CPU Limit 2. **Noticeable by infrastructure & hub admins**: 1. Cloud cost The *variables* available to Infrastructure Engineers and hub admins to tune are: 1. Size of instances offered 2. "Oversubscription" factor for memory - this is ratio of memory limit to memory guarantee. If users are using memory > guarantee but < limit, they *may* get their kernels killed. Based on our knowledge of this community, we can tune this variable to reduce cloud cost while also reducing disruption in terms of kernels being killed 3. "Oversubscription" factor for CPU. This is easier to handle, as CPUs can be *throttled* easily. A user may use 4 CPUs for a minute, but then go back to 2 cpus next minute without anything being "killed". This is unlike memory, where memory once given can not be taken back. If a user is over the guarantee and another user who is *under* the guarantee needs the memory, the first users's kernel *will* be killed. Since this doesn't happen with CPUs, we can be more liberal in oversubscribing CPUs. Goals ===== The goal is the following: 1. Profile options should be *automatically* generated by a script, with various options to be tuned by the whoever is running it. Engineers should have an easy time making these choices. 2. The *end user* should be able to easily understand the ramifications of the options they choose, and it should be visible to them *after* they start their notebook as well. 3. It's alright for users who want *more resources* to have to wait longer for a server start than users who want fewer resources. This is incentive to start with fewer resources and then size up. Generating Choices ================== This PR adds a new deployer command, `generate-resource-allocation-choices`, to be run by an engineer setting up a hub. It currently supports a *single* node type, and will generate appropriate *Resource Allocation* choices based on a given strategy. This PR implements one specific strategy that has been discussed well to work with the Openscapes community (#2882) and might be useful for other communities as well - the proportionate memory choice. Proportionate Memory Allocation Strategy ======================================== Used primarily in research cases where: 1. Workloads are more memory constrained than CPU constrained 2. End users can be expected to select appropriate amount of memory they need for a given workload, either by their own intrinsic knowledge or instructed by an instructor. It features: 1. No memory overcommit at all, as end users are expected to ask for as much memory as they need. 2. CPU *guarantees* are proportional to amount of memory guarantee - the more memory you ask for, the more CPU you are guaranteed. This allows end users to pick resources purely based on memory only, simplifying the mental model. Also allows for maximum packing of user pods onto a node, as we will *not* run out of CPU on a node before running out of memory. 3. No CPU limits at all, as CPU is a more flexible resource. The CPU guarantee will ensure that users will not be starved of CPU. 4. Each choice the user can make approximately has half as many resources as the next largest choice, with the largest being a full node. This offers a decent compromise - if you pick the largest option, you will most likely have to wait for a full node spawn, while smaller options are much more likely to be shared. In the future, other strategies would be added and experimented with. Node Capacity Information ========================= To generate these choices, we must have Node Capacity Information - particularly, exactly how much RAM and CPU is available for *user pods* on nodes of a particular type. Instead of using heuristics here, we calculate this *accurately*: Resource Available = Node Capacity - System Components (kubelet, systemd, etc) - Daemonsets A json file, `node-capacity-info.json` has this information and is updated with a command `update-node-capacity-info`. This requires a node with the given instance type be actively running so we can perform these calculations. This will need to be recalculated every time we upgrade kubernetes (as system components might take more resources) or adjust resource allocation for our daemonsets. This has been generated in this PR for a couple of common instances. TODO ==== - [ ] Documentation on how to update `node-capacity-info.json` - [ ] Documentation on how to generate choices, and when to use these - [ ] Documentation on how to choose the instance size Co-authored-by: Erik Sundell <erik.i.sundell@gmail.com>
- Loading branch information