Add script to generate nodeshare choices

When the end user looks at the profile list, the list needs to be presented in such a way that they can make an informed choice on what to select, with specific behavior that is triggered whenever their usage goes over the selected numbers. Factors ======= - Server startup time! If everyone gets an instance just for themselves, servers take forever to start. Usually, many users are active at the same time, and we can decrease server startup time by putting many users on the same machine in a way they don't step on each others' foot. - Cloud cost. If we pick really large machines, fewer scale up events need to be triggered, so server startup is much faster. However, we pay for instances regardless of how 'full' they are, so if we have a 64GB instance that only has 1GB used, we're paying extra for that. So a trade-off has to be chosen for *machine size*. This can be quantified though, and help make the tradeoff. - Resource *limits*, which the end user can consistently observe. These are easy to explain to end users - if you go over the memory limit, your kernel dies. If you go over the CPU limit, well, you can't - you get throttled. If we set limits appropriately, they will also helpfully show up in the status bar, with [jupyter-resource-usage](https://github.com/jupyter-server/jupyter-resource-usage) - Resource *requests* are harder for end users to observe, as they are primarily meant for the *scheduler*, on how to pack user nodes together for higher utilization. This has an 'oversubscription' factor, relying on the fact that most users don't actually use resources upto their limit. However, this factor varies community to community, and must be carefully tuned. Users may use more resources than they are guaranteed *sometimes*, but then get their kernels killed or CPU throttled at *some other times*, based on what *other* users are doing. This inconsistent behavior is confusing to end users, and we should be careful to figure this out. So in summary, there are two kinds of factors: 1. **Noticeable by users** 1. Server startup time 2. Memory Limit 3. CPU Limit 2. **Noticeable by infrastructure & hub admins**: 1. Cloud cost The *variables* available to Infrastructure Engineers and hub admins to tune are: 1. Size of instances offered 2. "Oversubscription" factor for memory - this is ratio of memory limit to memory guarantee. If users are using memory > guarantee but < limit, they *may* get their kernels killed. Based on our knowledge of this community, we can tune this variable to reduce cloud cost while also reducing disruption in terms of kernels being killed 3. "Oversubscription" factor for CPU. This is easier to handle, as CPUs can be *throttled* easily. A user may use 4 CPUs for a minute, but then go back to 2 cpus next minute without anything being "killed". This is unlike memory, where memory once given can not be taken back. If a user is over the guarantee and another user who is *under* the guarantee needs the memory, the first users's kernel *will* be killed. Since this doesn't happen with CPUs, we can be more liberal in oversubscribing CPUs. Goals ===== The goal is the following: 1. Profile options should be *automatically* generated by a script, with various options to be tuned by the whoever is running it. Engineers should have an easy time making these choices. 2. The *end user* should be able to easily understand the ramifications of the options they choose, and it should be visible to them *after* they start their notebook as well. 3. It's alright for users who want *more resources* to have to wait longer for a server start than users who want fewer resources. This is incentive to start with fewer resources and then size up. Generating Choices ================== This PR adds a new deployer command, `generate-resource-allocation-choices`, to be run by an engineer setting up a hub. It currently supports a *single* node type, and will generate appropriate *Resource Allocation* choices based on a given strategy. This PR implements one specific strategy that has been discussed well to work with the Openscapes community (#2882) and might be useful for other communities as well - the proportionate memory choice. Proportionate Memory Allocation Strategy ======================================== Used primarily in research cases where: 1. Workloads are more memory constrained than CPU constrained 2. End users can be expected to select appropriate amount of memory they need for a given workload, either by their own intrinsic knowledge or instructed by an instructor. It features: 1. No memory overcommit at all, as end users are expected to ask for as much memory as they need. 2. CPU *guarantees* are proportional to amount of memory guarantee - the more memory you ask for, the more CPU you are guaranteed. This allows end users to pick resources purely based on memory only, simplifying the mental model. Also allows for maximum packing of user pods onto a node, as we will *not* run out of CPU on a node before running out of memory. 3. No CPU limits at all, as CPU is a more flexible resource. The CPU guarantee will ensure that users will not be starved of CPU. 4. Each choice the user can make approximately has half as many resources as the next largest choice, with the largest being a full node. This offers a decent compromise - if you pick the largest option, you will most likely have to wait for a full node spawn, while smaller options are much more likely to be shared. In the future, other strategies would be added and experimented with. Node Capacity Information ========================= To generate these choices, we must have Node Capacity Information - particularly, exactly how much RAM and CPU is available for *user pods* on nodes of a particular type. Instead of using heuristics here, we calculate this *accurately*: Resource Available = Node Capacity - System Components (kubelet, systemd, etc) - Daemonsets A json file, `node-capacity-info.json` has this information and is updated with a command `update-node-capacity-info`. This requires a node with the given instance type be actively running so we can perform these calculations. This will need to be recalculated every time we upgrade kubernetes (as system components might take more resources) or adjust resource allocation for our daemonsets. This has been generated in this PR for a couple of common instances. TODO ==== - [ ] Documentation on how to update `node-capacity-info.json` - [ ] Documentation on how to generate choices, and when to use these - [ ] Documentation on how to choose the instance size Co-authored-by: Erik Sundell <erik.i.sundell@gmail.com>
2i2c-org · Aug 25, 2023 · 33a2b30 · 33a2b30
1 parent 8ba4f82
commit 33a2b30
Show file tree

Hide file tree

Showing 5 changed files with 280 additions and 0 deletions.
diff --git a/deployer/__main__.py b/deployer/__main__.py
@@ -10,6 +10,8 @@
 import deployer.grafana.central_grafana  # noqa: F401
 import deployer.grafana.grafana_tokens  # noqa: F401
 import deployer.keys.decrypt_age  # noqa: F401
+import deployer.resource_allocation.generate_choices  # noqa: F401
+import deployer.resource_allocation.update_nodeinfo  # noqa: F401
 
 from .cli_app import app
 

diff --git a/deployer/resource_allocation/__init__.py b/deployer/resource_allocation/__init__.py
diff --git a/deployer/resource_allocation/generate_choices.py b/deployer/resource_allocation/generate_choices.py
@@ -0,0 +1,117 @@
+import json
+import sys
+from enum import Enum
+from pathlib import Path
+
+import typer
+from ruamel.yaml import YAML
+
+from ..cli_app import app
+
+yaml = YAML(typ="rt")
+
+HERE = Path(__file__).parent
+
+
+class ResourceAllocationStrategies(str, Enum):
+    PROPORTIONAL_MEMORY_STRATEGY = "proportional-memory-strategy"
+
+
+def proportional_memory_strategy(nodeinfo: dict, num_allocations: int):
+    """
+    Generate choices for resource allocation based on proportional changes to memory
+
+    Used primarily in research cases where:
+    1. Workloads are more memory constrained than CPU constrained
+    2. End users can be expected to select appropriate amount of memory they need for a given
+       workload, either by their own intrinsic knowledge or instructed by an instructor.
+
+    It features:
+    1. No memory overcommit at all, as end users are expected to ask for as much memory as
+       they need.
+    2. CPU *guarantees* are proportional to amount of memory guarantee - the more memory you
+       ask for, the more CPU you are guaranteed. This allows end users to pick resources purely
+       based on memory only, simplifying the mental model. Also allows for maximum packing of
+       user pods onto a node, as we will *not* run out of CPU on a node before running out of
+       memory.
+    3. No CPU limits at all, as CPU is a more flexible resource. The CPU guarantee will ensure
+       that users will not be starved of CPU.
+    4. Each choice the user can make approximately has half as many resources as the next largest
+       choice, with the largest being a full node. This offers a decent compromise - if you pick
+       the largest option, you will most likely have to wait for a full node spawn, while smaller
+       options are much more likely to be shared.
+    """
+
+    # We operate on *available* memory, which already accounts for system components (like kubelet & systemd)
+    # as well as daemonsets we run on every node. This represents the resources that are available
+    # for user pods.
+    available_node_mem = nodeinfo["available"]["memory"]
+    available_node_cpu = nodeinfo["available"]["cpu"]
+
+    # We always start from the top, and provide a choice that takes up the whole node.
+    mem_limit = available_node_mem
+
+    choices = {}
+    for i in range(num_allocations):
+        # CPU guarantee is proportional to the memory limit for this particular choice.
+        # This makes sure we utilize all the memory on a node all the time.
+        cpu_guarantee = (mem_limit / available_node_mem) * available_node_cpu
+
+        # Memory is in bytes, let's convert it to GB to display
+        mem_display = mem_limit / 1024 / 1024 / 1024
+        display_name = f"{mem_display:.1f} GB RAM"
+
+        choice = {
+            "display_name": display_name,
+            "kubespawner_override": {
+                # Guarantee and Limit are the same - this strategy has no oversubscription
+                "mem_guarantee": int(mem_limit),
+                "mem_limit": int(mem_limit),
+                "cpu_guarantee": cpu_guarantee,
+                # CPU limit is set to entire available CPU of the node, making sure no single
+                # user can starve the node of critical kubelet / systemd resources.
+                # Leaving it unset sets it to same as guarantee, which we do not want.
+                "cpu_limit": available_node_cpu,
+            },
+        }
+        choices[f"mem_{num_allocations - i}"] = choice
+
+        # Halve the mem_limit for the next choice
+        mem_limit = mem_limit / 2
+
+    # Reverse the choices so the smallest one is first
+    choices = dict(reversed(choices.items()))
+
+    # Make the smallest choice the default explicitly
+    choices[list(choices.keys())[0]]["default"] = True
+
+    return choices
+
+
+@app.command()
+def generate_resource_allocation_choices(
+    instance_type: str = typer.Argument(
+        ..., help="Instance type to generate Resource Allocation options for"
+    ),
+    num_allocations: int = typer.Option(5, help="Number of choices to generate"),
+    strategy: ResourceAllocationStrategies = typer.Option(
+        ResourceAllocationStrategies.PROPORTIONAL_MEMORY_STRATEGY,
+        help="Strategy to use for generating resource allocation choices choices",
+    ),
+):
+    with open(HERE / "node-capacity-info.json") as f:
+        nodeinfo = json.load(f)
+
+    if instance_type not in nodeinfo:
+        print(
+            f"Capacity information about {instance_type} not available", file=sys.stderr
+        )
+        print("TODO: Provide information on how to update it", file=sys.stderr)
+        sys.exit(1)
+
+    # Call appropriate function based on what strategy we want to use
+    if strategy == ResourceAllocationStrategies.PROPORTIONAL_MEMORY_STRATEGY:
+        choices = proportional_memory_strategy(nodeinfo[instance_type], num_allocations)
+    else:
+        raise ValueError(f"Strategy {strategy} is not currently supported")
+    yaml.dump(choices, sys.stdout)
diff --git a/deployer/resource_allocation/node-capacity-info.json b/deployer/resource_allocation/node-capacity-info.json
@@ -0,0 +1,32 @@
+{
+    "r5.xlarge": {
+        "capacity": {
+            "cpu": 4.0,
+            "memory": 33186611200
+        },
+        "available": {
+            "cpu": 3.75,
+            "memory": 31883231232
+        }
+    },
+    "r5.16xlarge": {
+        "capacity": {
+            "cpu": 64.0,
+            "memory": 535146246144
+        },
+        "available": {
+            "cpu": 63.6,
+            "memory": 526011052032
+        }
+    },
+    "n2-highmem-4": {
+        "capacity": {
+            "cpu": 4.0,
+            "memory": 33670004736
+        },
+        "available": {
+            "cpu": 3.196,
+            "memory": 28975529984
+        }
+    }
+}
diff --git a/deployer/resource_allocation/update_nodeinfo.py b/deployer/resource_allocation/update_nodeinfo.py
@@ -0,0 +1,129 @@
+import json
+import subprocess
+from pathlib import Path
+
+import typer
+from kubernetes.utils.quantity import parse_quantity
+from ruamel.yaml import YAML
+
+from ..cli_app import app
+
+HERE = Path(__file__).parent
+
+yaml = YAML(typ="rt")
+
+
+def get_node_capacity_info(instance_type: str):
+    # Get full YAML spec of all nodes with this instance_type
+    nodes = json.loads(
+        subprocess.check_output(
+            [
+                "kubectl",
+                "get",
+                "node",
+                "-l",
+                f"node.kubernetes.io/instance-type={instance_type}",
+                "-o",
+                "json",
+            ]
+        ).decode()
+    )
+
+    if not nodes.get("items"):
+        # No nodes with given instance_type found!
+        # A node with this instance_type needs to be actively running for us to accurately
+        # calculate how much resources are available, as it relies on the non-jupyter pods
+        # running at that time.
+        raise ValueError(
+            f"No nodes with instance-type={instance_type} found in current kubernetes cluster"
+        )
+
+    # Just pick one node
+    node = nodes["items"][0]
+
+    # This is the toal amount of RAM and CPU on the node.
+    capacity = node["status"]["capacity"]
+    cpu_capacity = parse_quantity(capacity["cpu"])
+    mem_capacity = parse_quantity(capacity["memory"])
+
+    # Total amount of RAM and CPU available to kubernetes as a whole.
+    # This accounts for things running on the node, such as kubelet, the
+    # container runtime, systemd, etc. This does *not* count for daemonsets
+    # and pods runninng on the kubernetes cluster.
+    allocatable = node["status"]["allocatable"]
+    cpu_allocatable = parse_quantity(allocatable["cpu"])
+    mem_allocatable = parse_quantity(allocatable["memory"])
+
+    # Find all pods running on this node
+    all_pods = json.loads(
+        subprocess.check_output(
+            [
+                "kubectl",
+                "get",
+                "pod",
+                "-A",
+                "--field-selector",
+                f'spec.nodeName={node["metadata"]["name"]}',
+                "-o",
+                "json",
+            ]
+        ).decode()
+    )["items"]
+
+    # Filter out jupyterhub user pods
+    # TODO: Filter out dask scheduler and worker pods
+    pods = [
+        p
+        for p in all_pods
+        if p["metadata"]["labels"].get("component") not in ("singleuser-server",)
+    ]
+
+    # This is the amount of resources available for our workloads - jupyter and dask.
+    # We start with the allocatable resources, and subtract the resource *requirements*
+    # for all the pods that are running on every node, primarily from kube-system and
+    # support. The amount left over is what is available for the *scheduler* to put user pods
+    # on to.
+    cpu_available = cpu_allocatable
+    mem_available = mem_allocatable
+
+    for p in pods:
+        mem_request = 0
+        cpu_request = 0
+        # Iterate through all the containers in the pod, and count the memory & cpu requests
+        # they make. We don't count initContainers' requests as they don't overlap with the
+        # container requests at any point.
+        for c in p["spec"]["containers"]:
+            mem_request += parse_quantity(
+                c.get("resources", {}).get("requests", {}).get("memory", "0")
+            )
+            cpu_request += parse_quantity(
+                c.get("resources", {}).get("requests", {}).get("cpu", "0")
+            )
+        cpu_available -= cpu_request
+        mem_available -= mem_request
+
+    return {
+        # CPU units are  in fractions, while memory units are bytes
+        "capacity": {"cpu": float(cpu_capacity), "memory": int(mem_capacity)},
+        "available": {"cpu": float(cpu_available), "memory": int(mem_available)},
+    }
+
+
+@app.command()
+def update_node_capacity_info(
+    instance_type: str = typer.Argument(
+        ..., help="Instance type to generate Resource Allocation options for"
+    ),
+):
+    try:
+        with open(HERE / "node-capacity-info.json") as f:
+            instances_info = json.load(f)
+    except FileNotFoundError:
+        instances_info = {}
+    node_capacity = get_node_capacity_info(instance_type)
+
+    instances_info[instance_type] = node_capacity
+    with open(HERE / "node-capacity-info.json", "w") as f:
+        json.dump(instances_info, f, indent=4)
+
+    print(f"Updated node-capacity-info.json for {instance_type}")