-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] support to use custom gpu resource name if it's not nvidia.com/gpu #4337
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and welcome to SkyPilot @nkwangleiGIT!
sky/utils/common_utils.py
Outdated
def get_gpu_resource_name(name: str = 'nvidia.com/gpu'): | ||
"""Get the GPU resource name. | ||
|
||
The function first checks for an environment variable. | ||
If defined, it uses its value; otherwise, it returns the default value. | ||
|
||
Args: | ||
name (str): Default GPU resource name, default is "nvidia.com/gpu". | ||
|
||
Returns: | ||
str: The selected GPU resource name. | ||
""" | ||
# retrieve GPU resource name from environment variable, | ||
# can be nvidia.com/gpu-h100 if it's customized by the device plugin | ||
custom_name = os.getenv('CUSTOM_GPU_RESOURCE_NAME') | ||
|
||
# If the environment variable is not defined, return the default name | ||
if custom_name is None: | ||
return name | ||
|
||
return custom_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to sky.provision.kubernetes.utils
since only k8s relies on this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to sky.provision.kubernetes.utils
sky/utils/common_utils.py
Outdated
# retrieve GPU resource name from environment variable, | ||
# can be nvidia.com/gpu-h100 if it's customized by the device plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit
# retrieve GPU resource name from environment variable, | |
# can be nvidia.com/gpu-h100 if it's customized by the device plugin | |
# Retrieve GPU resource name from environment variable, if set. | |
# E.g., can be nvidia.com/gpu-h100, amd.com/gpu etc. |
sky/utils/common_utils.py
Outdated
|
||
|
||
def get_gpu_resource_name(name: str = 'nvidia.com/gpu'): | ||
"""Get the GPU resource name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Get the GPU resource name. | |
"""Get the GPU resource name to use in kubernetes. |
sky/utils/common_utils.py
Outdated
@@ -697,3 +697,26 @@ def truncate_long_string(s: str, max_length: int = 35) -> str: | |||
if len(prefix) < max_length: | |||
prefix += s[len(prefix):max_length] | |||
return prefix + '...' | |||
|
|||
|
|||
def get_gpu_resource_name(name: str = 'nvidia.com/gpu'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having name as a arg, we probably want to move the hardcoded string to the top of kubernetes_utils.py as DEFAULT_GPU_RESOURCE = 'nvidia.com/gpu'
and use that in the method if the envvar does not exist.
def get_gpu_resource_name(name: str = 'nvidia.com/gpu'): | |
def get_gpu_resource_name(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sky/provision/kubernetes/utils.py
Outdated
total={ | ||
common_utils.get_gpu_resource_name(): int(accelerator_count) | ||
}, | ||
free={ | ||
common_utils.get_gpu_resource_name(): | ||
int(accelerators_available) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaner to invoke once:
total={ | |
common_utils.get_gpu_resource_name(): int(accelerator_count) | |
}, | |
free={ | |
common_utils.get_gpu_resource_name(): | |
int(accelerators_available) | |
}) | |
gpu_resource = common_utils.get_gpu_resource_name() | |
total={gpu_resource: int(accelerator_count)}, | |
free={gpu_resource: int(accelerators_available) | |
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
18f7aac
to
8538f07
Compare
Signed-off-by: nkwangleiGIT <nkwanglei@126.com>
We're using customized gpu resource name when using the device plugin, so we have the following GPUs in the node capability:
So we have to use the resource names above when launch to local K8S, such as
So this PR will support to use CUSTOM_GPU_RESOURCE_NAME from environment variable to overwrite the default nvidia.com/gpu in the resources.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh