ML environments on GPU instances #675
Replies: 6 comments 8 replies
-
Here is the relevant documentation for the same: https://github.com/Quansight/qhub/blob/main/docs/source/04_how_to_guides/7_qhub_gpu.md#amazon-web-services
You'll create a new node group as mentioned in the documentation above.
You can create a new JupyterLab profile to select the GPU node, so that you have a separate profile for GPU which user can select and it will always run on the GPU node, the doc for the same is in the link mentioned above. |
Beta Was this translation helpful? Give feedback.
-
Are you able to spin up a GPU instance but it doesn't work in Python? i.e. after spinning up a GPU instance does running Also I assume you are on aws. |
Beta Was this translation helpful? Give feedback.
-
@dharhas, no, we were not sure how to set up the nebari-config to allow users to select either a GPU-driven or CPU-driven server. Would GPU and CPU be different jhub-apps? |
Beta Was this translation helpful? Give feedback.
-
Ok let me share a config from an AWS deployment, I need to sanitize it. |
Beta Was this translation helpful? Give feedback.
-
amazon_web_services:
region: eu-west-1
kubernetes_version: '1.26'
node_groups:
general:
instance: m5.2xlarge
min_nodes: 2
max_nodes: 5
user:
instance: m5.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
worker:
instance: m5.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
fly-weight:
instance: m5.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
middle-weight:
instance: m5.2xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
gpu-1x-t4:
instance: g4dn.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
profiles:
jupyterlab:
- display_name: Small Instance
description: Stable environment with 1.5-2 cpu / 6-8 GB ram
default: true
kubespawner_override:
cpu_limit: 2
cpu_guarantee: 1.5
mem_limit: 8G
mem_guarantee: 6G
node_selector:
"dedicated": "fly-weight"
- display_name: Medium Instance
description: Stable environment with 1.5-2 cpu / 6-8 GB ram
kubespawner_override:
cpu_limit: 4
cpu_guarantee: 2
mem_limit: 12G
mem_guarantee: 8G
node_selector:
"dedicated": "middle-weight"
- display_name: G4 GPU Instance 1x
description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.1.1
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
extra_resource_limits:
nvidia.com/gpu: 1
node_selector:
"dedicated": "gpu-1x-t4" The extra_container_config is important. PyTorch requires /dev/shm > 1GB if you are using multi gpu, not sure it matters for single GPU. Also you have to specify the number of gpu in the extra_resource_limits and that needs to match the number of gpu's in the instance you have selected i.e. the We will work on getting this into the docs. A secondary issue is the released conda-store currently can't install GPU versions of pytorch from conda-forge because we are not able to set env variables (see conda-incubator/conda-store#759) this will be in the upcoming release. Until then you need to get pytorch using the following pinning: channels:
- pytorch
- nvidia
- conda-forge
dependencies:
- python=3.10
- pytorch::pytorch
- pytorch::pytorch-cuda
- etc |
Beta Was this translation helpful? Give feedback.
-
Thanks to @pt247 and @marcelovilla who helped me get this going!
|
Beta Was this translation helpful? Give feedback.
-
We have some users who would like to run their ML workflows on a single GPU using
tensorflow-gpu
. I can create a custom environment with that package, but how can I specify the gpu instances type needed to run it? Our config for AWS looks like below, but if I change the user group to, say, ag4dn.4xlarge
image, then all user will get them, which is of course not what I want. Is there a way for users to choose instance type as well as environment at server launch time?Beta Was this translation helpful? Give feedback.
All reactions