ML environments on GPU instances #675

rsignell-usgs · 2021-06-21T16:04:38Z

rsignell-usgs
Jun 21, 2021

We have some users who would like to run their ML workflows on a single GPU using tensorflow-gpu. I can create a custom environment with that package, but how can I specify the gpu instances type needed to run it? Our config for AWS looks like below, but if I change the user group to, say, a g4dn.4xlarge image, then all user will get them, which is of course not what I want. Is there a way for users to choose instance type as well as environment at server launch time?

  node_groups:
    general:
      #instance: "t3a.medium"     # 2 vCPUs,  4 GB: AMD EPYC @ 2.5 GHz 
      # instance: "m5.large"       # 4 vCPU, 8GB
      instance: "m5.xlarge"      # 8 vCPU, 16GB. 8GB instance was to little for conda to solve
      min_nodes: 2
      max_nodes: 2
    user:
      instance: "m5.2xlarge"     # 8 vCPUs, 32 GB: Intel Xeon Platinum 8000 @ 3.1 GHz 
      #instance: "m5.large"     
      min_nodes: 1
      max_nodes: 100
    worker:
      #instance: "r5a.2xlarge"    # 8 vCPUs, 64 GB: AMD EPYC @ 2.5 GHz 
      instance: "m5.2xlarge"      # 8 vCPUs, 32 GB: Intel Xeon Platinum 8000 @ 3.1 GHz 
      min_nodes: 1
      max_nodes: 302

aktech · 2021-06-21T16:18:16Z

aktech
Jun 21, 2021
Collaborator

Here is the relevant documentation for the same: https://github.com/Quansight/qhub/blob/main/docs/source/04_how_to_guides/7_qhub_gpu.md#amazon-web-services

but how can I specify the gpu instances type needed to run it?

You'll create a new node group as mentioned in the documentation above.

Our config for AWS looks like below, but if I change the user group to, say, a g4dn.4xlarge image, then all user will get them, which is of course not what I want. Is there a way for users to choose instance type as well as environment at server launch time?

You can create a new JupyterLab profile to select the GPU node, so that you have a separate profile for GPU which user can select and it will always run on the GPU node, the doc for the same is in the link mentioned above.

4 replies

rsignell-usgs Jun 21, 2021
Author

@aktech, tried it, but it failed. Apparently the GPU stuff is not enabled yet in 0.3.11, the version we are currently using, right?

aktech Jun 21, 2021
Collaborator

Yes, that's correct.

geo-rao Mar 6, 2024

@aktech @dharhas I am working with @rsignell this week trying to set up a Nebari environment for ML users with GPU enabled. I couldn't find any documentation on configure GPU profiles for users. The documentation linked in the FAQs page is currently not accessible (https://www.nebari.dev/docs/tutorials/login-keycloak#4-select-a-profile, and https://www.nebari.dev/docs/how-tos/pytorch-best-practices). Do you know if there is any documentation that I can refer to on this?

rsignell Mar 7, 2024

Just to pile on here, it would be great to get a complete example configuration file for a Nebari deployment that offers an ML Pytorch environment that runs on GPU. Something that would give users options something like Planetary Computer on start up (well, at least the CPU and GPU environments):

dharhas · 2024-03-08T13:50:41Z

dharhas
Mar 8, 2024
Maintainer

@rsignell

Are you able to spin up a GPU instance but it doesn't work in Python? i.e. after spinning up a GPU instance does running nvidia-smi from a terminal show you a GPU? There is a known issue with conda-forge and conda-store that I can give you a workaround for if so.

Also I assume you are on aws.

0 replies

rsignell · 2024-03-08T13:53:26Z

rsignell
Mar 8, 2024

@dharhas, no, we were not sure how to set up the nebari-config to allow users to select either a GPU-driven or CPU-driven server. Would GPU and CPU be different jhub-apps?

0 replies

dharhas · 2024-03-08T13:56:32Z

dharhas
Mar 8, 2024
Maintainer

Ok let me share a config from an AWS deployment, I need to sanitize it.

2 replies

rsignell Mar 8, 2024

Perfect @dharhas That's exactly what we were hoping for! Thanks!

geo-rao Mar 8, 2024

Excellent, thanks @dharhas!

dharhas · 2024-03-08T14:57:35Z

dharhas
Mar 8, 2024
Maintainer

amazon_web_services:
  region: eu-west-1
  kubernetes_version: '1.26'
  node_groups:
    general:
      instance: m5.2xlarge
      min_nodes: 2
      max_nodes: 5
    user:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    worker:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    fly-weight:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    middle-weight:
      instance: m5.2xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    gpu-1x-t4:
      instance: g4dn.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
profiles:
  jupyterlab:
    - display_name: Small Instance
      description: Stable environment with 1.5-2 cpu / 6-8 GB ram
      default: true
      kubespawner_override:
        cpu_limit: 2
        cpu_guarantee: 1.5
        mem_limit: 8G
        mem_guarantee: 6G
        node_selector:
          "dedicated": "fly-weight"
    - display_name: Medium Instance
      description: Stable environment with 1.5-2 cpu / 6-8 GB ram
      kubespawner_override:
        cpu_limit: 4
        cpu_guarantee: 2
        mem_limit: 12G
        mem_guarantee: 8G
        node_selector:
          "dedicated": "middle-weight"
    - display_name: G4 GPU Instance 1x
      description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
      kubespawner_override:
        image: quay.io/nebari/nebari-jupyterlab-gpu:2024.1.1
        cpu_limit: 4
        cpu_guarantee: 3
        mem_limit: 16G
        mem_guarantee: 10G
        extra_pod_config:
          volumes:
            - name: "dshm"
              emptyDir:
                medium: "Memory"
                sizeLimit: "2Gi"
        extra_container_config:
          volumeMounts:
            - name: "dshm"
              mountPath: "/dev/shm"
        extra_resource_limits:
          nvidia.com/gpu: 1
        node_selector:
          "dedicated": "gpu-1x-t4"

The extra_container_config is important. PyTorch requires /dev/shm > 1GB if you are using multi gpu, not sure it matters for single GPU. Also you have to specify the number of gpu in the extra_resource_limits and that needs to match the number of gpu's in the instance you have selected i.e. the g4dn.xlarge

We will work on getting this into the docs.

A secondary issue is the released conda-store currently can't install GPU versions of pytorch from conda-forge because we are not able to set env variables (see conda-incubator/conda-store#759) this will be in the upcoming release. Until then you need to get pytorch using the following pinning:

channels:
  - pytorch
  - nvidia
  - conda-forge
dependencies:
  - python=3.10
  - pytorch::pytorch
  - pytorch::pytorch-cuda
  - etc

1 reply

geo-rao Mar 8, 2024

Thanks @dharhas - I will give it a try and report back!

rsignell · 2024-04-02T19:45:56Z

rsignell
Apr 2, 2024

Thanks to @pt247 and @marcelovilla who helped me get this going!
It turned out I need to remove these lines from my config:

        extra_resource_limits:
          nvidia.com/gpu: 1

1 reply

marcelovilla Apr 2, 2024
Maintainer

It was actually @pt247!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nebari-dev

ML environments on GPU instances #675

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

ML environments on GPU instances #675

Replies: 6 comments · 8 replies

aktech Jun 21, 2021 Collaborator

rsignell-usgs Jun 21, 2021 Author

aktech Jun 21, 2021 Collaborator

dharhas Mar 8, 2024 Maintainer

dharhas Mar 8, 2024 Maintainer

dharhas Mar 8, 2024 Maintainer

marcelovilla Apr 2, 2024 Maintainer

Replies: 6 comments 8 replies

aktech
Jun 21, 2021
Collaborator

rsignell-usgs Jun 21, 2021
Author

aktech Jun 21, 2021
Collaborator

dharhas
Mar 8, 2024
Maintainer

dharhas
Mar 8, 2024
Maintainer

dharhas
Mar 8, 2024
Maintainer

marcelovilla Apr 2, 2024
Maintainer