Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] support to use custom gpu resource name if it's not nvidia.com/gpu #4337

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nkwangleiGIT
Copy link
Contributor

@nkwangleiGIT nkwangleiGIT commented Nov 12, 2024

We're using customized gpu resource name when using the device plugin, so we have the following GPUs in the node capability:

nvidia.com/gpu-h100
nvidia.com/gpu-h20
nvidia.com/gpu-4090

So we have to use the resource names above when launch to local K8S, such as

sky launch --image-id skypilot:20240613 --cpus 8 --memory 32 --gpus gpu-3090:2 -c my-sky-cluster --cloud kubernetes

So this PR will support to use CUSTOM_GPU_RESOURCE_NAME from environment variable to overwrite the default nvidia.com/gpu in the resources.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and welcome to SkyPilot @nkwangleiGIT!

Comment on lines 702 to 722
def get_gpu_resource_name(name: str = 'nvidia.com/gpu'):
"""Get the GPU resource name.

The function first checks for an environment variable.
If defined, it uses its value; otherwise, it returns the default value.

Args:
name (str): Default GPU resource name, default is "nvidia.com/gpu".

Returns:
str: The selected GPU resource name.
"""
# retrieve GPU resource name from environment variable,
# can be nvidia.com/gpu-h100 if it's customized by the device plugin
custom_name = os.getenv('CUSTOM_GPU_RESOURCE_NAME')

# If the environment variable is not defined, return the default name
if custom_name is None:
return name

return custom_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to sky.provision.kubernetes.utils since only k8s relies on this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to sky.provision.kubernetes.utils

Comment on lines 714 to 715
# retrieve GPU resource name from environment variable,
# can be nvidia.com/gpu-h100 if it's customized by the device plugin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
# retrieve GPU resource name from environment variable,
# can be nvidia.com/gpu-h100 if it's customized by the device plugin
# Retrieve GPU resource name from environment variable, if set.
# E.g., can be nvidia.com/gpu-h100, amd.com/gpu etc.



def get_gpu_resource_name(name: str = 'nvidia.com/gpu'):
"""Get the GPU resource name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Get the GPU resource name.
"""Get the GPU resource name to use in kubernetes.

@@ -697,3 +697,26 @@ def truncate_long_string(s: str, max_length: int = 35) -> str:
if len(prefix) < max_length:
prefix += s[len(prefix):max_length]
return prefix + '...'


def get_gpu_resource_name(name: str = 'nvidia.com/gpu'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having name as a arg, we probably want to move the hardcoded string to the top of kubernetes_utils.py as DEFAULT_GPU_RESOURCE = 'nvidia.com/gpu' and use that in the method if the envvar does not exist.

Suggested change
def get_gpu_resource_name(name: str = 'nvidia.com/gpu'):
def get_gpu_resource_name():

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 1860 to 1866
total={
common_utils.get_gpu_resource_name(): int(accelerator_count)
},
free={
common_utils.get_gpu_resource_name():
int(accelerators_available)
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaner to invoke once:

Suggested change
total={
common_utils.get_gpu_resource_name(): int(accelerator_count)
},
free={
common_utils.get_gpu_resource_name():
int(accelerators_available)
})
gpu_resource = common_utils.get_gpu_resource_name()
total={gpu_resource: int(accelerator_count)},
free={gpu_resource: int(accelerators_available)
})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: nkwangleiGIT <nkwanglei@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants