|
| 1 | +# Resource Allocation on Profile Lists |
| 2 | + |
| 3 | +This document lays out general guidelines on how to think about what goes into |
| 4 | +the list of choices about resource allocation that are presented to the user |
| 5 | +in the profile List, so they can make an informed choice about what they want |
| 6 | +without getting overwhelmed. |
| 7 | + |
| 8 | +This primarily applies just to research hubs, not educational hubs. |
| 9 | + |
| 10 | +## Factors to balance |
| 11 | + |
| 12 | +1. **Server startup time** |
| 13 | + |
| 14 | + If everyone gets an instance just for themselves, servers |
| 15 | + take forever to start. Usually, many users are active at the same time, and we |
| 16 | + can decrease server startup time by putting many users on the same machine in a |
| 17 | + way they don't step on each others' foot. |
| 18 | + |
| 19 | +2. **Cloud cost** |
| 20 | + |
| 21 | + If we pick really large machines, fewer scale up events need to be |
| 22 | + triggered, so server startup is much faster. However, we pay for instances |
| 23 | + regardless of how 'full' they are, so if we have a 64GB instance that only has |
| 24 | + 1GB used, we're paying extra for that. So a trade-off has to be chosen for |
| 25 | + *machine size*. This can be quantified though, and help make the tradeoff. |
| 26 | + |
| 27 | +3. **Resource *limits*, which the end user can consistently observe**. |
| 28 | + |
| 29 | + Memory limits are easy to explain to end users - if you go over the memory limit, your |
| 30 | + kernel dies. If you go over the CPU limit, well, you can't - you get throttled. |
| 31 | + If we set limits appropriately, they will also helpfully show up in the status |
| 32 | + bar, with |
| 33 | + [jupyter-resource-usage](https://github.com/jupyter-server/jupyter-resource-usage) |
| 34 | + |
| 35 | +4. **Resource *requests* are harder for end users to observe** , as they are primarily |
| 36 | + meant for the *scheduler*, on how to pack user nodes together for higher |
| 37 | + utilization. This has an 'oversubscription' factor, relying on the fact that |
| 38 | + most users don't actually use resources upto their limit. However, this factor |
| 39 | + varies community to community, and must be carefully tuned. Users may use more |
| 40 | + resources than they are guaranteed *sometimes*, but then get their kernels |
| 41 | + killed or CPU throttled at *some other times*, based on what *other* users are |
| 42 | + doing. This inconsistent behavior is confusing to end users, and we should be |
| 43 | + careful to figure this out. |
| 44 | + |
| 45 | +So in summary, there are two kinds of factors: |
| 46 | + |
| 47 | +1. **Noticeable by users** |
| 48 | + 1. Server startup time |
| 49 | + 2. Memory Limit |
| 50 | + 3. CPU Limit |
| 51 | + |
| 52 | +2. **Noticeable by infrastructure & hub admins**: |
| 53 | + 1. Cloud cost (proxied via utilization %) |
| 54 | + |
| 55 | +The *variables* available to Infrastructure Engineers and hub admins to tune |
| 56 | +are: |
| 57 | + |
| 58 | +1. Size of instances offered |
| 59 | + |
| 60 | +2. "Oversubscription" factor for memory - this is ratio of memory limit to |
| 61 | + memory guarantee. If users are using memory > guarantee but < limit, they *may* |
| 62 | + get their kernels killed. Based on our knowledge of this community, we can tune |
| 63 | + this variable to reduce cloud cost while also reducing disruption in terms of |
| 64 | + kernels being killed |
| 65 | + |
| 66 | +3. "Oversubscription" factor for CPU. This is easier to handle, as CPUs can be |
| 67 | + *throttled* easily. A user may use 4 CPUs for a minute, but then go back to 2 |
| 68 | + cpus next minute without anything being "killed". This is unlike memory, where |
| 69 | + memory once given can not be taken back. If a user is over the guarantee and |
| 70 | + another user who is *under* the guarantee needs the memory, the first users's |
| 71 | + kernel *will* be killed. Since this doesn't happen with CPUs, we can be more |
| 72 | + liberal in oversubscribing CPUs. |
| 73 | + |
| 74 | +## UX Goals |
| 75 | + |
| 76 | +The goal when generating list of resource allocation choices is the following: |
| 77 | + |
| 78 | +1. Profile options should be *automatically* generated by a script, with various |
| 79 | + options to be tuned by the whoever is running it. Engineers should have an easy |
| 80 | + time making these choices. |
| 81 | + |
| 82 | +2. The *end user* should be able to easily understand the ramifications of the |
| 83 | + options they choose, and it should be visible to them *after* they start their |
| 84 | + notebook as well. |
| 85 | + |
| 86 | +3. It's alright for users who want *more resources* to have to wait longer for a |
| 87 | + server start than users who want fewer resources. This is incentive to start |
| 88 | + with fewer resources and then size up. |
0 commit comments