Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests_l2: Added gpu hwinfo workload #155

Merged
merged 1 commit into from
Oct 10, 2023
Merged

Conversation

vbedida79
Copy link
Contributor

@vbedida79 vbedida79 commented Oct 10, 2023

Added GPU hwinfo ubi9 based workload for OCP 4.13
Based on driver verification
Signed-off-by: vbedida79 veenadhari.bedida@intel.com

Signed-off-by: vbedida79 <veenadhari.bedida@intel.com>
@hershpa
Copy link
Contributor

hershpa commented Oct 10, 2023

@vbedida79 great work, this looks good. If we plan to convert this to an automation test case in future, what is the expected behavior of the container?

Ideally, we want something that we can programmatically check at runtime, i.e. it would run hwinfo --display and then we can check if it succeeds or fails based on pod status or some exit code.

@vbedida79
Copy link
Contributor Author

Ideally, we want something that we can programmatically check at runtime, i.e. it would run hwinfo --display and then we can check if it succeeds or fails based on pod status or some exit code.

this would be a good case for automation. yes pod status and count of gpu resource. if possible, some part of output too

@chaitanya1731
Copy link
Contributor

@vbedida79 Can you also add the steps in the README to run and verify this workload? @uMartinXu @hershpa do you think we should create readme for specific feature instead of a common generic one? As we keep adding new workloads, it will increase the contents eventually making it difficult to navigate through readme.. for example three separate readmes for respective device plugins directories...

@vbedida79
Copy link
Contributor Author

vbedida79 commented Oct 10, 2023

@vbedida79 Can you also add the steps in the README to run and verify this workload? @uMartinXu @hershpa do you think we should create readme for specific feature instead of a common generic one? As we keep adding new workloads, it will increase the contents eventually making it difficult to navigate through readme.. for example three separate readmes for respective device plugins directories...

yes, will submit next. I plan to add in gpu section of tests. if not we can add in device plugin readme, either works.

@chaitanya1731
Copy link
Contributor

if not we can add in device plugin readme, either works.

Sorry for the confusion about device plugin readme I meant something like l2/dgpu/README.md, l2/qat/README.md and so on..

@vbedida79
Copy link
Contributor Author

vbedida79 commented Oct 10, 2023

if not we can add in device plugin readme, either works.

Sorry for the confusion about device plugin readme I meant something like l2/dgpu/README.md, l2/qat/README.md and so on..

np. good idea. for now added in #156. can change it to separate readme's

@uMartinXu
Copy link
Contributor

@vbedida79 Can you also add the steps in the README to run and verify this workload? @uMartinXu @hershpa do you think we should create readme for specific feature instead of a common generic one? As we keep adding new workloads, it will increase the contents eventually making it difficult to navigate through readme.. for example three separate readmes for respective device plugins directories...

This is a good question. A specific readme for a single feature is a good idea. I think in the long run we should do that. But now just continue to use the current readme schema, and at the same time, let's listen to the users for feedback.

@uMartinXu
Copy link
Contributor

uMartinXu commented Oct 10, 2023

This PR looks good to me. BTW should we also have an L1 test case for the dGPU OOT driver testing?

@vbedida79
Copy link
Contributor Author

vbedida79 commented Oct 10, 2023

But should we also have an L1 test case for the dGPU OOT driver testing?

what kind of tests are we looking at? I think we can use clinfo/hwinfo for that too

@uMartinXu
Copy link
Contributor

But should we also have an L1 test case for the dGPU OOT driver testing?

what kind of tests are we looking at? I think we can use clinfo/hwinfo for that too

You are right we should use clinfo/hwinfo. But since for L1 testing, we have no provisioning stack there on cluster, so we can not claim the i915 resources.

@uMartinXu uMartinXu merged commit f6265cd into intel:main Oct 10, 2023
1 check failed
@vbedida79
Copy link
Contributor Author

vbedida79 commented Oct 10, 2023

You are right we should use clinfo/hwinfo. But since for L1 testing, we have no provisioning stack there on cluster, so we can not claim the i915 resources.

How about we run it as a daemonset on all kmm labelled nodes- i.e nodes where driver has loaded. wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants