Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update KFTO multi-node test names according to recent updates in orig… #2164

Conversation

abhijeet-dhumal
Copy link
Contributor

Update KFTO multi-node test names according to recent updates in original test names

Related to : opendatahub-io/distributed-workloads#299

Copy link
Contributor

github-actions bot commented Jan 9, 2025

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass %
603 0 0 603 100

@ChughShilpa
Copy link
Contributor

What about other 2 test scenarios
TestPyTorchJobMnistMultiNodeMultiGpuWithCuda and TestPyTorchJobMnistMultiNodeMultiGpuWithROCm ?
Will you add it in another PR ?

@abhijeet-dhumal
Copy link
Contributor Author

abhijeet-dhumal commented Jan 9, 2025

What about other 2 test scenarios TestPyTorchJobMnistMultiNodeMultiGpuWithCuda and TestPyTorchJobMnistMultiNodeMultiGpuWithROCm ? Will you add it in another PR ?

@ChughShilpa Actually the remaining MultiNode/MultiGPUs tests requires 2 cluster-nodes with minimum 2 GPUs each (GPU instance like g4dn.12xlarge - A100 GPUs), which I'm not sure whether will be available during QG tests..
Even after this pre-requisite, is it ok to add these tests here?
cc: @sutaakar

@sutaakar
Copy link
Contributor

sutaakar commented Jan 9, 2025

We can add the tests to ODS CI, just we can't run them as part of QG, only as part of our own jobs.

@ChughShilpa
Copy link
Contributor

What about other 2 test scenarios TestPyTorchJobMnistMultiNodeMultiGpuWithCuda and TestPyTorchJobMnistMultiNodeMultiGpuWithROCm ? Will you add it in another PR ?

@ChughShilpa Actually the remaining MultiNode/MultiGPUs tests requires 2 cluster-nodes with minimum 2 GPUs each (GPU instance like g4dn.12xlarge - A100 GPUs), which I'm not sure whether will be available during QG tests.. Even after this pre-requisite, is it ok to add these tests here? cc: @sutaakar

g4dn.12xlarge instance is used in qe-jenkins, and we also have Resources-2GPUS tag and can be used for this requirement, the only thing is we might need to inform the devtestops team for this

Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeWithROCm ${ROCM_TRAINING_IMAGE}

Run Training operator KFTO_MNIST multi-node multi-gpu test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node multi-gpu test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 2 GPUs each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (176/120)
Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeMultiGpuWithCuda ${CUDA_TRAINING_IMAGE}

Run Training operator KFTO_MNIST multi-node multi-gpu test with AMD ROCm image
[Documentation] Run Go KFTO_MNIST multi-node multi-gpu test for Training operator using PyTorch job with AMD ROCm image - It requires 2 cluster-nodes with 2 GPUs each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (174/120)
sutaakar
sutaakar previously approved these changes Jan 9, 2025
Run Training operator KFTO_MNIST multi-node CPU test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node CPU test for Training operator using PyTorch job with NVIDIA CUDA image
Run Training operator KFTO_MNIST multi-node single-CPU test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node single-CPU test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with at least 1 CPUs each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (186/120)
Run Training operator KFTO_MNIST multi-node test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with NVIDIA CUDA image
Run Training operator KFTO_MNIST multi-node multi-CPU test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node multi-CPU test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 2 CPUs each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (176/120)
Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeMultiCpu ${CUDA_TRAINING_IMAGE}

Run Training operator KFTO_MNIST multi-node single-GPU test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node single-GPU test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 1 GPU each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (176/120)
Run Training operator KFTO_MNIST multi-node test with AMD ROCm image
[Documentation] Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with AMD ROCm image
Run Training operator KFTO_MNIST multi-node single-GPU test with AMD ROCm image
[Documentation] Run Go KFTO_MNIST multi-node single-GPU test for Training operator using PyTorch job with AMD ROCm image - It requires 2 cluster-nodes with 1 GPU each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (174/120)
@abhijeet-dhumal abhijeet-dhumal force-pushed the update-kfto-mutinode-test-names branch from ffd8213 to 2a5986d Compare January 20, 2025 12:56
sutaakar
sutaakar previously approved these changes Jan 20, 2025
jiripetrlik
jiripetrlik previously approved these changes Jan 22, 2025
Copy link
Contributor

@ChughShilpa ChughShilpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@sutaakar
Copy link
Contributor

/approve

Copy link

openshift-ci bot commented Jan 27, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: abhijeet-dhumal, ChughShilpa, jiripetrlik, sutaakar

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sutaakar sutaakar merged commit fd945bf into red-hat-data-services:master Jan 27, 2025
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants