Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD Support #142

Open
anilmurty opened this issue Oct 16, 2023 · 12 comments
Open

AMD Support #142

anilmurty opened this issue Oct 16, 2023 · 12 comments
Assignees
Labels
P1 repo/akash-api repo/node Akash node repo issues repo/provider Akash provider-services repo issues

Comments

@anilmurty
Copy link

anilmurty commented Oct 16, 2023

Support for AMD GPUs on Akash Network. There may not be any significant work necessary but first step is to test with an AMD GPU(s). This is very important because AMD is working on the MI 250 chipset which is expected to be a serious contender to Nvidia A100 and H100 chips. Here is a blog from MosaicML benchmarking and comparing its performance with Nvidia's chips: https://www.mosaicml.com/blog/amd-mi250

It seems like the initial work is validating whether the kubernetes device plugin for AMD can work for us (the way the Nvidia one has) https://github.com/RadeonOpenCompute/k8s-device-plugin#deployment

Is this something that a community person can help with?

@brewsterdrinkwater
Copy link
Collaborator

Oct 24 sync:

  • Want to add AMD support as a follow on to Nvidia.
  • Anil is getting access to AMD 210 very soon. Need to make sure AMD plugin works.
  • Artur mentioned that nothing needs to be done with the Network. Kubernetes device installation is something to work on.
  • Need to update SDL parser.
  • Need to update provider to accept AMD as provider.

@anilmurty
Copy link
Author

anilmurty commented Nov 7, 2023

We have access to a cluster that includes a Nvidia L40 and an AMD 210 GPU. @andy108369 is working on testing out setting up a provider with them.

Current status: L40 works out of the box (as expected), AMD does not. Per @troian , we filter on "Nvidia" GPUs in nodes and providers. Artur needs to work on removing this filter and setting up a testnet for Andrey to test with. Removing this filtering likely shouldn't need a network upgrade

@anilmurty anilmurty transferred this issue from akash-network/community Nov 7, 2023
@brewsterdrinkwater
Copy link
Collaborator

December 5th, 2023:

  • Filter is there right now. Client needs to be updated.
  • Artur looking to work on that this week.
  • will test releases when they come out; first on Sandbox.
  • DOES NOT need network upgrade.

@troian troian added repo/node Akash node repo issues repo/provider Akash provider-services repo issues repo/akash-api P1 labels Dec 6, 2023
troian added a commit to akash-network/node that referenced this issue Dec 8, 2023
refs akash-network/support#142

Signed-off-by: Artur Troian <troian.ap@gmail.com>
troian added a commit to akash-network/node that referenced this issue Dec 9, 2023
refs akash-network/support#142

Signed-off-by: Artur Troian <troian.ap@gmail.com>
troian added a commit to akash-network/node that referenced this issue Dec 9, 2023
refs akash-network/support#142

Signed-off-by: Artur Troian <troian.ap@gmail.com>
troian added a commit to akash-network/node that referenced this issue Dec 9, 2023
refs akash-network/support#142

Signed-off-by: Artur Troian <troian.ap@gmail.com>
troian added a commit to akash-network/node that referenced this issue Dec 9, 2023
refs akash-network/support#142

Signed-off-by: Artur Troian <troian.ap@gmail.com>
troian added a commit to akash-network/node that referenced this issue Dec 9, 2023
refs akash-network/support#142

Signed-off-by: Artur Troian <troian.ap@gmail.com>
troian added a commit to akash-network/node that referenced this issue Dec 9, 2023
refs akash-network/support#142

Signed-off-by: Artur Troian <troian.ap@gmail.com>
@brewsterdrinkwater
Copy link
Collaborator

December 12th, 2023

  • SDL part is done.
  • Working on provider part next
  • This will be tested by core team over the next couple of days

@brewsterdrinkwater
Copy link
Collaborator

brewsterdrinkwater commented Dec 19, 2023

December 19th, 2023

  • RC for provider cut this morning.
  • Testing is going on right now for AMD GPU support.
  • WIll test on SDXL app, as well.

Next Steps:

  • Suggested Path Forward for Validation of AMD GPU Support (can discuss further in eng sync today)
  • Andy and Scott will spin up providers and test the build process and deployment process.
  • Documentation will be created after testing is complete.

@andy108369
Copy link
Contributor

Updates:

@andy108369
Copy link
Contributor

andy108369 commented Dec 22, 2023

Test run results

  • 🟢 provider version 0.4.9-rc0 (both provider & client): AMD GPU MI210 deployment works! (evidence in private repo atm)
  • 🔴 provider: issue with the aggregated GPU count when mixed GPU Vendors (e.g. NVIDIA & AMD) are present on the same worker node (kubectl describe node <node-name> should only report 'nvidia.com/gpuORamd.com/gpuK8s node attribute, otherwise it will only and always see a single GPU or no GPU / Limbo [flapping between0/1 gpu count]; and you cannot easily remove K8s node attributes such as 'nvidia.com/gpu, amd.com/gpu as they get stuck in etcd [K8s DB] - the only way is to reinstantiate the node.)

Next steps:

  • akash-provider: address the akash-provider issue with the wrong GPU count when mixed GPU Vendors on a node are present;
  • security/helm-chart: see whether we can deploy that amd-gpu-helm/amd-gpu helm-chart in its own namespace instead of kube-system for better security;
  • security/helm-chart: make sure one cannot request more AMD GPU than he should.. refs. HIP_VISIBLE_DEVICES
    / ROCR_VISIBLE_DEVICES (similarly to how it was possible with NVIDIA GPU via NVIDIA_VISIBLE_DEVICES=all env variable, which was addressed here ) ; refs ; Update (Jan/08/2024): - raised a question GPU isolation options ROCm/k8s-device-plugin#45
  • usability/helm-chart: see if we can have rocm-smi tool by default in the AMD GPU Pod (just like we get nvidia-smi tool in NVIDIA GPU Pod - which is done by the nvidia device plugin by mounting the necessary host paths and is controlled by environment variables such as NVIDIA_DRIVER_CAPABILITIES - more examples/info here) ; Update (Jan/08/2024) - raised a question Libraries/binaries mounted in the container (analogous to NVIDIA_DRIVER_CAPABILITIES) ROCm/k8s-device-plugin#44
  • docs: document AMD GPU Akash Provider enablement (preliminary test version of the doc Docs: How to enable AMD GPU support in Akash Provider is in the private repo atm)

@andy108369
Copy link
Contributor

  • security/helm-chart: see whether we can deploy that amd-gpu-helm/amd-gpu helm-chart in its own namespace instead of kube-system for better security;

This is possible - requires --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin flags to be specified as follows:

helm install --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin my-amd-gpu amd-gpu-helm/amd-gpu --version 0.10.0

Verification:

root@node1:~# helm install --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin my-amd-gpu amd-gpu-helm/amd-gpu --version 0.10.0
NAME: my-amd-gpu
LAST DEPLOYED: Mon Jan  8 12:38:25 2024
NAMESPACE: amd-device-plugin
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
amd-gpu-device-plugin-daemonset deployed in namespace 'amd-device-plugin'

@troian troian moved this from In Progress (prioritized) to In Test (or staging) in Core Product and Engineering Roadmap Jan 9, 2024
@brewsterdrinkwater
Copy link
Collaborator

January 16th, 2024:

  • Need to update documentation.

@anilmurty
Copy link
Author

Additional notes: We currently have a limitation (applies to both Nvidia and AMD) where we (K8s) cannot allow mixing of models on the same node. it is fine to mix models on the provider (accross) as long as each node only has GPUs of same model.

@brewsterdrinkwater
Copy link
Collaborator

January 23rd:

  • documentation being worked on this week.

@andy108369 andy108369 self-assigned this Jan 24, 2024
@andy108369
Copy link
Contributor

andy108369 commented Jan 25, 2024

pushed the AMD GPU support doc, now available at https://docs.akash.network/other-resources/experimental/amd-gpu-support

  • I'll go through it once more once I get the access to the AMD GPU box.

@brewsterdrinkwater brewsterdrinkwater moved this from In Test (or staging) to Released (in Prod) in Core Product and Engineering Roadmap Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 repo/akash-api repo/node Akash node repo issues repo/provider Akash provider-services repo issues
Projects
Status: Released (in Prod)
Development

No branches or pull requests

4 participants