Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out funding source/model for GPUs for Storage Optimization experiments #1387

Open
knikolla opened this issue Sep 18, 2024 · 12 comments
Open
Assignees

Comments

@knikolla
Copy link

Motivation

GPU resources used for storage optimization experiments aren't free and need someone to foot the bill.

Completion Criteria

Have clear agreement on who is paying for the GPU resources.

Description

  • [ ]

Completion dates

Desired - 2024-09-25
Required - TBD

@msdisme
Copy link

msdisme commented Sep 19, 2024

Can we apply this to existing loads - eg. instructlab or other similar projects. This would more tightly connect the experiments to real data as opposed to synthetic data and also decrease costs.

For a more synthetic set of data how much GPU time /how many GPUs are necessary for the experiments.

@msdisme
Copy link

msdisme commented Sep 25, 2024

@knikolla ^^

@knikolla
Copy link
Author

knikolla commented Oct 2, 2024

Talked with Michael yesterday. Will come back to this issue with a clearer number in the coming 2 weeks.

@knikolla
Copy link
Author

@msdisme 1 GPU for around 80 hours.

@msdisme
Copy link

msdisme commented Oct 23, 2024

  1. Will a V100 or A100 not in the Lenovo watercooled rack work for this?
  2. OK if they are in the production cluster running over openshift?
  3. @hakasapl @naved001 adding Hakan and Naved here in case they need to comment on perf.

@knikolla
Copy link
Author

@msdisme responses inline

  1. Will a V100 or A100 not in the Lenovo watercooled rack work for this?

If they are available via OpenShift's production cluster sure.

  1. OK if they are in the production cluster running over openshift?

Actually necessary for them to be in OpenShift production cluster.

  1. @hakasapl @naved001 adding Hakan and Naved here in case they need to comment on perf.

@naved001
Copy link

If they are available via OpenShift's production cluster sure.

The V100s are in the production openshift cluster, not sure how many are actually free to be used. We can check that later. I think the only A100s in the openshift cluster are the lenovo kind.

@knikolla
Copy link
Author

The V100s are in the production openshift cluster, not sure how many are actually free to be used. We can check that later. I think the only A100s in the openshift cluster are the lenovo kind.

@naved001 do you know what kind of drives these nodes have (that can be accessed through something like emptyDir?

@naved001
Copy link

@hakasapl do you know the drive types (or check from the iDRAC)? From within the OS it says the drive is behind the raid controller and I can't seem to install smartctl on the debug pod to get more information.

These are the V100 nodes:

➜  ~ oc get nodes -l 'nvidia.com/gpu.product=Tesla-V100-PCIE-32GB'
NAME      STATUS   ROLES    AGE    VERSION
wrk-102   Ready    worker   42d    v1.28.11+add48d0
wrk-103   Ready    worker   42d    v1.28.11+add48d0
wrk-104   Ready    worker   42d    v1.28.11+add48d0
wrk-106   Ready    worker   42d    v1.28.11+add48d0
wrk-107   Ready    worker   42d    v1.28.11+add48d0
wrk-108   Ready    worker   42d    v1.28.11+add48d0
wrk-88    Ready    worker   399d   v1.28.11+add48d0
wrk-89    Ready    worker   399d   v1.28.11+add48d0

@knikolla knikolla self-assigned this Oct 23, 2024
@hakasapl
Copy link

@naved001 in the spreadsheet I have them as 4x446 GiB SSD. I don't think I have access to these since they are in openstack, I probably do through the vpn but I'm not sure what their addresses are. I think Augestine did the install on these

@naved001
Copy link

@hakasapl I can reach the idrac of wrk-88 (wrk-88-obm.nerc-ocp-prod.nerc.mghpcc.org), the other ones don't return an IP. But then I don't know the password of this idrac.

@hakasapl
Copy link

Image

This is what is in wrk-88

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants