Skip to content

Commit 73cad86

Browse files
committed
Add a resource allocation topic document
1 parent 9657576 commit 73cad86

File tree

2 files changed

+89
-0
lines changed

2 files changed

+89
-0
lines changed

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ topic/access-creds/index.md
8888
topic/infrastructure/index.md
8989
topic/monitoring-alerting/index.md
9090
topic/features.md
91+
topic/resource-allocations.md
9192
```
9293

9394
## Reference

docs/topic/resource-allocation.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Resource Allocation on Profile Lists
2+
3+
This document lays out general guidelines on how to think about what goes into
4+
the list of choices about resource allocation that are presented to the user
5+
in the profile List, so they can make an informed choice about what they want
6+
without getting overwhelmed.
7+
8+
This primarily applies just to research hubs, not educational hubs.
9+
10+
## Factors to balance
11+
12+
1. **Server startup time**
13+
14+
If everyone gets an instance just for themselves, servers
15+
take forever to start. Usually, many users are active at the same time, and we
16+
can decrease server startup time by putting many users on the same machine in a
17+
way they don't step on each others' foot.
18+
19+
2. **Cloud cost**
20+
21+
If we pick really large machines, fewer scale up events need to be
22+
triggered, so server startup is much faster. However, we pay for instances
23+
regardless of how 'full' they are, so if we have a 64GB instance that only has
24+
1GB used, we're paying extra for that. So a trade-off has to be chosen for
25+
*machine size*. This can be quantified though, and help make the tradeoff.
26+
27+
3. **Resource *limits*, which the end user can consistently observe**.
28+
29+
Memory limits are easy to explain to end users - if you go over the memory limit, your
30+
kernel dies. If you go over the CPU limit, well, you can't - you get throttled.
31+
If we set limits appropriately, they will also helpfully show up in the status
32+
bar, with
33+
[jupyter-resource-usage](https://github.com/jupyter-server/jupyter-resource-usage)
34+
35+
4. **Resource *requests* are harder for end users to observe** , as they are primarily
36+
meant for the *scheduler*, on how to pack user nodes together for higher
37+
utilization. This has an 'oversubscription' factor, relying on the fact that
38+
most users don't actually use resources upto their limit. However, this factor
39+
varies community to community, and must be carefully tuned. Users may use more
40+
resources than they are guaranteed *sometimes*, but then get their kernels
41+
killed or CPU throttled at *some other times*, based on what *other* users are
42+
doing. This inconsistent behavior is confusing to end users, and we should be
43+
careful to figure this out.
44+
45+
So in summary, there are two kinds of factors:
46+
47+
1. **Noticeable by users**
48+
1. Server startup time
49+
2. Memory Limit
50+
3. CPU Limit
51+
52+
2. **Noticeable by infrastructure & hub admins**:
53+
1. Cloud cost (proxied via utilization %)
54+
55+
The *variables* available to Infrastructure Engineers and hub admins to tune
56+
are:
57+
58+
1. Size of instances offered
59+
60+
2. "Oversubscription" factor for memory - this is ratio of memory limit to
61+
memory guarantee. If users are using memory > guarantee but < limit, they *may*
62+
get their kernels killed. Based on our knowledge of this community, we can tune
63+
this variable to reduce cloud cost while also reducing disruption in terms of
64+
kernels being killed
65+
66+
3. "Oversubscription" factor for CPU. This is easier to handle, as CPUs can be
67+
*throttled* easily. A user may use 4 CPUs for a minute, but then go back to 2
68+
cpus next minute without anything being "killed". This is unlike memory, where
69+
memory once given can not be taken back. If a user is over the guarantee and
70+
another user who is *under* the guarantee needs the memory, the first users's
71+
kernel *will* be killed. Since this doesn't happen with CPUs, we can be more
72+
liberal in oversubscribing CPUs.
73+
74+
## UX Goals
75+
76+
The goal when generating list of resource allocation choices is the following:
77+
78+
1. Profile options should be *automatically* generated by a script, with various
79+
options to be tuned by the whoever is running it. Engineers should have an easy
80+
time making these choices.
81+
82+
2. The *end user* should be able to easily understand the ramifications of the
83+
options they choose, and it should be visible to them *after* they start their
84+
notebook as well.
85+
86+
3. It's alright for users who want *more resources* to have to wait longer for a
87+
server start than users who want fewer resources. This is incentive to start
88+
with fewer resources and then size up.

0 commit comments

Comments
 (0)