Issue with Akask Provider: Misconfigured Worker Node (wrong `gpu` reporting - extremely high value) #137

sfxworks · 2023-10-24T18:22:50Z

Problem

User quantomworks experienced an issue where cloudmos was not recognizing their node despite being active and running a test deployment. Further discussions with SGC | DCNorse identified a problematic configuration where the output indicated the intriguing number for the GPU count:

"gpu":18446744073709551615

This number, 18446744073709551615, was determined to be a representation of -1 when interpreted as a unsigned 64-bit integer ¹. The user queried this unexpected interpretation and what might cause such an issue.

Diagnosis

SGC | DCNorse suggested that when a GPU goes offline while the provider/worker node is running, the reported number of GPUs might be incorrect. They proposed resolution methods such as shutting down some operations or rebooting the node.

Subsequent Findings

There seemed to be inconsistency between the available GPUs reported by Akash and the actual resources in Kubernetes.
The issue persisted even after updating the driver.
Misconfiguration was identified when moving GPU's between worker nodes. The labels for quantomworks Kubernetes/Akash worker node could be erroneous.
quantomworks detected that the node misunderstanding the number of GPUs available could be because of the node being in use but presenting -1. The nodes in question weren't labeled for GPU use.

Solution

The issue was resolved when correctly labeling the nodes. It seemed that since they weren't initally labeled, it assumed 0 GPU availability for consumption but also presented a -1 due to a workload using them.

{"name":"game-1","allocatable":{"cpu":11000,"gpu":1,"memory":67242971136,"storage_ephemeral":390585090631},"available":{"cpu":3000,"gpu":0,"memory":47008821248,"storage_ephemeral":390585090631}},{"name":"game-2","allocatable":{"cpu":11000,"gpu":1,"memory":67242594304,"storage_ephemeral":390586034349},"available":{"cpu":2285,"gpu":0,"memory":47172022272,"storage_ephemeral":390586034349}}

This correct labeling should allow cloudmos to correctly identify the node and compute resources.

Suggestions

Going forward, these insights would be useful to the community:

A provider troubleshooting page on cloudmos would be handy for issue identification and solutions.
Potential issues when mixing usage of GPUs locally and with Akash given the nature of passthrough in Akash.

https://stackoverflow.com/questions/40608111/why-is-18446744073709551615-1-true ↩

The text was updated successfully, but these errors were encountered:

andy108369 · 2023-11-16T14:22:46Z

Not sure if related, but I am working on one provider now (the other issue where provider isn't releasing all of the GPU's it has),

So the gpu number briefly spiked up to 18446744073709552000 after I've bounced all the four nvdp-nvidia-device-plugin pods (using kubectl -n nvidia-device-plugin delete pods --all command) (4 worker nodes, each has 8x A100 GPUs)

provider logs at that moment https://transfer.sh/H3CVdjajcx/provider-briefly-spiked-gpu-numbers.log

cc @chainzero @troian

Update:

This can be reproduced easily, just bounce the nvdp-nvidia-device-plugin pod and then query provider's 8443/status .
You have to catch that brief moment, maybe that's when the nvdp-nvidia-device-plugin is being initialized. (Haven't tried just scaling them down)

troian added repo/provider Akash provider-services repo issues sev2 labels Oct 25, 2023

chainzero assigned andy108369 Nov 8, 2023

andy108369 changed the title ~~Issue with Akask Provider: Misconfigured Worker Node~~ Issue with Akask Provider: Misconfigured Worker Node (wrong gpu reporting - extremely high value) Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Akask Provider: Misconfigured Worker Node (wrong `gpu` reporting - extremely high value) #137

Issue with Akask Provider: Misconfigured Worker Node (wrong `gpu` reporting - extremely high value) #137

sfxworks commented Oct 24, 2023

andy108369 commented Nov 16, 2023 •

edited

Loading

Issue with Akask Provider: Misconfigured Worker Node (wrong gpu reporting - extremely high value) #137

Issue with Akask Provider: Misconfigured Worker Node (wrong gpu reporting - extremely high value) #137

Comments

sfxworks commented Oct 24, 2023

Problem

Diagnosis

Subsequent Findings

Solution

Suggestions

Footnotes

andy108369 commented Nov 16, 2023 • edited Loading

Issue with Akask Provider: Misconfigured Worker Node (wrong `gpu` reporting - extremely high value) #137

Issue with Akask Provider: Misconfigured Worker Node (wrong `gpu` reporting - extremely high value) #137

andy108369 commented Nov 16, 2023 •

edited

Loading