Issue with Akask Provider: Misconfigured Worker Node (wrong gpu
reporting - extremely high value)
#137
Labels
gpu
reporting - extremely high value)
#137
Problem
User
quantomworks
experienced an issue wherecloudmos
was not recognizing their node despite being active and running a test deployment. Further discussions withSGC | DCNorse
identified a problematic configuration where the output indicated the intriguing number for the GPU count:This number,
18446744073709551615
, was determined to be a representation of-1
when interpreted as a unsigned 64-bit integer 1. The user queried this unexpected interpretation and what might cause such an issue.Diagnosis
SGC | DCNorse
suggested that when a GPU goes offline while the provider/worker node is running, the reported number of GPUs might be incorrect. They proposed resolution methods such as shutting down some operations or rebooting the node.Subsequent Findings
There seemed to be inconsistency between the available GPUs reported by Akash and the actual resources in Kubernetes.
The issue persisted even after updating the driver.
Misconfiguration was identified when moving GPU's between worker nodes. The labels for
quantomworks
Kubernetes/Akash worker node could be erroneous.quantomworks
detected that the node misunderstanding the number of GPUs available could be because of the node being in use but presenting-1
. The nodes in question weren't labeled for GPU use.Solution
The issue was resolved when correctly labeling the nodes. It seemed that since they weren't initally labeled, it assumed 0 GPU availability for consumption but also presented a
-1
due to a workload using them.This correct labeling should allow
cloudmos
to correctly identify the node and compute resources.Suggestions
Going forward, these insights would be useful to the community:
cloudmos
would be handy for issue identification and solutions.Footnotes
https://stackoverflow.com/questions/40608111/why-is-18446744073709551615-1-true ↩
The text was updated successfully, but these errors were encountered: