Skip to content

Commit

Permalink
Add script to generate nodeshare choices
Browse files Browse the repository at this point in the history
When the end user looks at the profile list, the list needs to be
presented in such a way that they can make an informed choice on what
to select, with specific behavior that is triggered whenever their
usage goes over the selected numbers.

Factors
=======

- Server startup time! If everyone gets an instance just for themselves, servers
  take forever to start. Usually, many users are active at the same time, and we
  can decrease server startup time by putting many users on the same machine in
  a way they don't step on each others' foot.

- Cloud cost. If we pick really large machines, fewer scale up events need to
  be triggered, so server startup is much faster. However, we pay for instances
  regardless of how 'full' they are, so if we have a 64GB instance that only has
  1GB used, we're paying extra for that. So a trade-off has to be chosen for
  *machine size*. This can be quantified though, and help make the tradeoff.

- Resource *limits*, which the end user can consistently observe. These are
  easy to explain to end users - if you go over the memory limit, your kernel
  dies. If you go over the CPU limit, well, you can't - you get throttled. If
  we set limits appropriately, they will also helpfully show up in the status
  bar, with [jupyter-resource-usage](https://github.com/jupyter-server/jupyter-resource-usage)

- Resource *requests* are harder for end users to observe, as they are primarily
  meant for the *scheduler*, on how to pack user nodes together for higher
  utilization. This has an 'oversubscription' factor, relying on the fact that
  most users don't actually use resources upto their limit. However, this factor
  varies community to community, and must be carefully tuned. Users may use
  more resources than they are guaranteed *sometimes*, but then get their kernels
  killed or CPU throttled at *some other times*, based on what *other* users
  are doing. This inconsistent behavior is confusing to end users, and we should
  be careful to figure this out.

So in summary, there are two kinds of factors:

1. **Noticeable by users**
   1. Server startup time
   2. Memory Limit
   3. CPU Limit

2. **Noticeable by infrastructure & hub admins**:
   1. Cloud cost

The *variables* available to Infrastructure Engineers and hub admins to tune are:

1. Size of instances offered

2. "Oversubscription" factor for memory - this is ratio of memory
   limit to memory guarantee. If users are using memory > guarantee but <
   limit, they *may* get their kernels killed. Based on our knowledge of
   this community, we can tune this variable to reduce cloud cost while
   also reducing disruption in terms of kernels being killed

3. "Oversubscription" factor for CPU. This is easier to handle, as
   CPUs can be *throttled* easily. A user may use 4 CPUs for a minute,
   but then go back to 2 cpus next minute without anything being
   "killed". This is unlike memory, where memory once given can not be
   taken back. If a user is over the guarantee and another user who
   is *under* the guarantee needs the memory, the first users's
   kernel *will* be killed. Since this doesn't happen with CPUs, we can
   be more liberal in oversubscribing CPUs.

Goals
=====

The goal is the following:

1. Profile options should be *automatically* generated by a script,
   with various options to be tuned by the whoever is running
   it. Engineers should have an easy time making these choices.

2. The *end user* should be able to easily understand the
   ramifications of the options they choose, and it should be visible to
   them *after* they start their notebook as well.

3. It's alright for users who want *more resources* to have to wait
   longer for a server start than users who want fewer resources. This is
   incentive to start with fewer resources and then size up.

Generating Choices
==================

This PR adds a new deployer command,
`generate-resource-allocation-choices`, to be run by an engineer
setting up a hub. It currently supports a *single* node type, and will
generate appropriate *Resource Allocation* choices based on a given
strategy. This PR implements one specific strategy that has been
discussed well to work with the Openscapes
community (#2882) and
might be useful for other communities as well - the proportionate
memory choice.

Proportionate Memory Allocation Strategy
========================================

Used primarily in research cases where:
1. Workloads are more memory constrained than CPU constrained
2. End users can be expected to select appropriate amount of memory they need for a given
    workload, either by their own intrinsic knowledge or instructed by an instructor.

It features:
1. No memory overcommit at all, as end users are expected to ask for as much memory as
   they need.
2. CPU *guarantees* are proportional to amount of memory guarantee - the more memory you
   ask for, the more CPU you are guaranteed. This allows end users to pick resources purely
   based on memory only, simplifying the mental model. Also allows for maximum packing of
   user pods onto a node, as we will *not* run out of CPU on a node before running out of
   memory.
3. No CPU limits at all, as CPU is a more flexible resource. The CPU guarantee will ensure
   that users will not be starved of CPU.
4. Each choice the user can make approximately has half as many resources as the next largest
   choice, with the largest being a full node. This offers a decent compromise - if you pick
   the largest option, you will most likely have to wait for a full node spawn, while smaller
    options are much more likely to be shared.

In the future, other strategies would be added and experimented with.

Node Capacity Information
=========================

To generate these choices, we must have Node Capacity Information -
particularly, exactly how much RAM and CPU is available for *user
pods* on nodes of a particular type. Instead of using heuristics
here, we calculate this *accurately*:

Resource Available = Node Capacity - System Components (kubelet,
systemd, etc) - Daemonsets

A json file, `node-capacity-info.json` has this information and is
updated with a command `update-node-capacity-info`. This requires
a node with the given instance type be actively running so we
can perform these calculations. This will need to be recalculated
every time we upgrade kubernetes (as system components might take
more resources) or adjust resource allocation for our daemonsets.

This has been generated in this PR for a couple of common instances.

TODO
====

- [ ] Documentation on how to update `node-capacity-info.json`
- [ ] Documentation on how to generate choices, and when to use
      these
- [ ] Documentation on how to choose the instance size

Co-authored-by: Erik Sundell <erik.i.sundell@gmail.com>
  • Loading branch information
yuvipanda and consideRatio committed Aug 25, 2023
1 parent 8ba4f82 commit 33a2b30
Show file tree
Hide file tree
Showing 5 changed files with 280 additions and 0 deletions.
2 changes: 2 additions & 0 deletions deployer/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
import deployer.grafana.central_grafana # noqa: F401
import deployer.grafana.grafana_tokens # noqa: F401
import deployer.keys.decrypt_age # noqa: F401
import deployer.resource_allocation.generate_choices # noqa: F401
import deployer.resource_allocation.update_nodeinfo # noqa: F401

from .cli_app import app

Expand Down
Empty file.
117 changes: 117 additions & 0 deletions deployer/resource_allocation/generate_choices.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
import json
import sys
from enum import Enum
from pathlib import Path

import typer
from ruamel.yaml import YAML

from ..cli_app import app

yaml = YAML(typ="rt")

HERE = Path(__file__).parent


class ResourceAllocationStrategies(str, Enum):
PROPORTIONAL_MEMORY_STRATEGY = "proportional-memory-strategy"


def proportional_memory_strategy(nodeinfo: dict, num_allocations: int):
"""
Generate choices for resource allocation based on proportional changes to memory
Used primarily in research cases where:
1. Workloads are more memory constrained than CPU constrained
2. End users can be expected to select appropriate amount of memory they need for a given
workload, either by their own intrinsic knowledge or instructed by an instructor.
It features:
1. No memory overcommit at all, as end users are expected to ask for as much memory as
they need.
2. CPU *guarantees* are proportional to amount of memory guarantee - the more memory you
ask for, the more CPU you are guaranteed. This allows end users to pick resources purely
based on memory only, simplifying the mental model. Also allows for maximum packing of
user pods onto a node, as we will *not* run out of CPU on a node before running out of
memory.
3. No CPU limits at all, as CPU is a more flexible resource. The CPU guarantee will ensure
that users will not be starved of CPU.
4. Each choice the user can make approximately has half as many resources as the next largest
choice, with the largest being a full node. This offers a decent compromise - if you pick
the largest option, you will most likely have to wait for a full node spawn, while smaller
options are much more likely to be shared.
"""

# We operate on *available* memory, which already accounts for system components (like kubelet & systemd)
# as well as daemonsets we run on every node. This represents the resources that are available
# for user pods.
available_node_mem = nodeinfo["available"]["memory"]
available_node_cpu = nodeinfo["available"]["cpu"]

# We always start from the top, and provide a choice that takes up the whole node.
mem_limit = available_node_mem

choices = {}
for i in range(num_allocations):
# CPU guarantee is proportional to the memory limit for this particular choice.
# This makes sure we utilize all the memory on a node all the time.
cpu_guarantee = (mem_limit / available_node_mem) * available_node_cpu

# Memory is in bytes, let's convert it to GB to display
mem_display = mem_limit / 1024 / 1024 / 1024
display_name = f"{mem_display:.1f} GB RAM"

choice = {
"display_name": display_name,
"kubespawner_override": {
# Guarantee and Limit are the same - this strategy has no oversubscription
"mem_guarantee": int(mem_limit),
"mem_limit": int(mem_limit),
"cpu_guarantee": cpu_guarantee,
# CPU limit is set to entire available CPU of the node, making sure no single
# user can starve the node of critical kubelet / systemd resources.
# Leaving it unset sets it to same as guarantee, which we do not want.
"cpu_limit": available_node_cpu,
},
}
choices[f"mem_{num_allocations - i}"] = choice

# Halve the mem_limit for the next choice
mem_limit = mem_limit / 2

# Reverse the choices so the smallest one is first
choices = dict(reversed(choices.items()))

# Make the smallest choice the default explicitly
choices[list(choices.keys())[0]]["default"] = True

return choices


@app.command()
def generate_resource_allocation_choices(
instance_type: str = typer.Argument(
..., help="Instance type to generate Resource Allocation options for"
),
num_allocations: int = typer.Option(5, help="Number of choices to generate"),
strategy: ResourceAllocationStrategies = typer.Option(
ResourceAllocationStrategies.PROPORTIONAL_MEMORY_STRATEGY,
help="Strategy to use for generating resource allocation choices choices",
),
):
with open(HERE / "node-capacity-info.json") as f:
nodeinfo = json.load(f)

if instance_type not in nodeinfo:
print(
f"Capacity information about {instance_type} not available", file=sys.stderr
)
print("TODO: Provide information on how to update it", file=sys.stderr)
sys.exit(1)

# Call appropriate function based on what strategy we want to use
if strategy == ResourceAllocationStrategies.PROPORTIONAL_MEMORY_STRATEGY:
choices = proportional_memory_strategy(nodeinfo[instance_type], num_allocations)
else:
raise ValueError(f"Strategy {strategy} is not currently supported")
yaml.dump(choices, sys.stdout)
32 changes: 32 additions & 0 deletions deployer/resource_allocation/node-capacity-info.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"r5.xlarge": {
"capacity": {
"cpu": 4.0,
"memory": 33186611200
},
"available": {
"cpu": 3.75,
"memory": 31883231232
}
},
"r5.16xlarge": {
"capacity": {
"cpu": 64.0,
"memory": 535146246144
},
"available": {
"cpu": 63.6,
"memory": 526011052032
}
},
"n2-highmem-4": {
"capacity": {
"cpu": 4.0,
"memory": 33670004736
},
"available": {
"cpu": 3.196,
"memory": 28975529984
}
}
}
129 changes: 129 additions & 0 deletions deployer/resource_allocation/update_nodeinfo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
import json
import subprocess
from pathlib import Path

import typer
from kubernetes.utils.quantity import parse_quantity
from ruamel.yaml import YAML

from ..cli_app import app

HERE = Path(__file__).parent

yaml = YAML(typ="rt")


def get_node_capacity_info(instance_type: str):
# Get full YAML spec of all nodes with this instance_type
nodes = json.loads(
subprocess.check_output(
[
"kubectl",
"get",
"node",
"-l",
f"node.kubernetes.io/instance-type={instance_type}",
"-o",
"json",
]
).decode()
)

if not nodes.get("items"):
# No nodes with given instance_type found!
# A node with this instance_type needs to be actively running for us to accurately
# calculate how much resources are available, as it relies on the non-jupyter pods
# running at that time.
raise ValueError(
f"No nodes with instance-type={instance_type} found in current kubernetes cluster"
)

# Just pick one node
node = nodes["items"][0]

# This is the toal amount of RAM and CPU on the node.
capacity = node["status"]["capacity"]
cpu_capacity = parse_quantity(capacity["cpu"])
mem_capacity = parse_quantity(capacity["memory"])

# Total amount of RAM and CPU available to kubernetes as a whole.
# This accounts for things running on the node, such as kubelet, the
# container runtime, systemd, etc. This does *not* count for daemonsets
# and pods runninng on the kubernetes cluster.
allocatable = node["status"]["allocatable"]
cpu_allocatable = parse_quantity(allocatable["cpu"])
mem_allocatable = parse_quantity(allocatable["memory"])

# Find all pods running on this node
all_pods = json.loads(
subprocess.check_output(
[
"kubectl",
"get",
"pod",
"-A",
"--field-selector",
f'spec.nodeName={node["metadata"]["name"]}',
"-o",
"json",
]
).decode()
)["items"]

# Filter out jupyterhub user pods
# TODO: Filter out dask scheduler and worker pods
pods = [
p
for p in all_pods
if p["metadata"]["labels"].get("component") not in ("singleuser-server",)
]

# This is the amount of resources available for our workloads - jupyter and dask.
# We start with the allocatable resources, and subtract the resource *requirements*
# for all the pods that are running on every node, primarily from kube-system and
# support. The amount left over is what is available for the *scheduler* to put user pods
# on to.
cpu_available = cpu_allocatable
mem_available = mem_allocatable

for p in pods:
mem_request = 0
cpu_request = 0
# Iterate through all the containers in the pod, and count the memory & cpu requests
# they make. We don't count initContainers' requests as they don't overlap with the
# container requests at any point.
for c in p["spec"]["containers"]:
mem_request += parse_quantity(
c.get("resources", {}).get("requests", {}).get("memory", "0")
)
cpu_request += parse_quantity(
c.get("resources", {}).get("requests", {}).get("cpu", "0")
)
cpu_available -= cpu_request
mem_available -= mem_request

return {
# CPU units are in fractions, while memory units are bytes
"capacity": {"cpu": float(cpu_capacity), "memory": int(mem_capacity)},
"available": {"cpu": float(cpu_available), "memory": int(mem_available)},
}


@app.command()
def update_node_capacity_info(
instance_type: str = typer.Argument(
..., help="Instance type to generate Resource Allocation options for"
),
):
try:
with open(HERE / "node-capacity-info.json") as f:
instances_info = json.load(f)
except FileNotFoundError:
instances_info = {}
node_capacity = get_node_capacity_info(instance_type)

instances_info[instance_type] = node_capacity
with open(HERE / "node-capacity-info.json", "w") as f:
json.dump(instances_info, f, indent=4)

print(f"Updated node-capacity-info.json for {instance_type}")

0 comments on commit 33a2b30

Please sign in to comment.