Update KFTO multi-node test names according to recent updates in orig… #2164

abhijeet-dhumal · 2025-01-09T06:50:37Z

Update KFTO multi-node test names according to recent updates in original test names

Related to : opendatahub-io/distributed-workloads#299

...i/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot

github-actions · 2025-01-09T06:55:25Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %
603	0	0	603	100

ChughShilpa · 2025-01-09T07:00:26Z

What about other 2 test scenarios
TestPyTorchJobMnistMultiNodeMultiGpuWithCuda and TestPyTorchJobMnistMultiNodeMultiGpuWithROCm ?
Will you add it in another PR ?

abhijeet-dhumal · 2025-01-09T07:17:43Z

What about other 2 test scenarios TestPyTorchJobMnistMultiNodeMultiGpuWithCuda and TestPyTorchJobMnistMultiNodeMultiGpuWithROCm ? Will you add it in another PR ?

@ChughShilpa Actually the remaining MultiNode/MultiGPUs tests requires 2 cluster-nodes with minimum 2 GPUs each (GPU instance like g4dn.12xlarge - A100 GPUs), which I'm not sure whether will be available during QG tests..
Even after this pre-requisite, is it ok to add these tests here?
cc: @sutaakar

sutaakar · 2025-01-09T08:27:14Z

We can add the tests to ODS CI, just we can't run them as part of QG, only as part of our own jobs.

ChughShilpa · 2025-01-09T08:35:57Z

What about other 2 test scenarios TestPyTorchJobMnistMultiNodeMultiGpuWithCuda and TestPyTorchJobMnistMultiNodeMultiGpuWithROCm ? Will you add it in another PR ?

@ChughShilpa Actually the remaining MultiNode/MultiGPUs tests requires 2 cluster-nodes with minimum 2 GPUs each (GPU instance like g4dn.12xlarge - A100 GPUs), which I'm not sure whether will be available during QG tests.. Even after this pre-requisite, is it ok to add these tests here? cc: @sutaakar

g4dn.12xlarge instance is used in qe-jenkins, and we also have Resources-2GPUS tag and can be used for this requirement, the only thing is we might need to inform the devtestops team for this

...i/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot

+    Run Training Operator KFTO Test    TestPyTorchJobMnistMultiNodeWithROCm    ${ROCM_TRAINING_IMAGE}
+
+Run Training operator KFTO_MNIST multi-node multi-gpu test with NVIDIA CUDA image
+    [Documentation]    Run Go KFTO_MNIST multi-node multi-gpu test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 2 GPUs each


...i/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot

+    Run Training Operator KFTO Test    TestPyTorchJobMnistMultiNodeMultiGpuWithCuda    ${CUDA_TRAINING_IMAGE}
+
+Run Training operator KFTO_MNIST multi-node multi-gpu test with AMD ROCm image
+    [Documentation]    Run Go KFTO_MNIST multi-node multi-gpu test for Training operator using PyTorch job with AMD ROCm image  - It requires 2 cluster-nodes with 2 GPUs each


...i/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot

-Run Training operator KFTO_MNIST multi-node CPU test with NVIDIA CUDA image
-    [Documentation]    Run Go KFTO_MNIST multi-node CPU test for Training operator using PyTorch job with NVIDIA CUDA image
+Run Training operator KFTO_MNIST multi-node single-CPU test with NVIDIA CUDA image
+    [Documentation]    Run Go KFTO_MNIST multi-node single-CPU test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with at least 1 CPUs each


...i/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot

-Run Training operator KFTO_MNIST multi-node test with NVIDIA CUDA image
-    [Documentation]    Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with NVIDIA CUDA image
+Run Training operator KFTO_MNIST multi-node multi-CPU test with NVIDIA CUDA image
+    [Documentation]    Run Go KFTO_MNIST multi-node multi-CPU test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 2 CPUs each


...i/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot

+    Run Training Operator KFTO Test    TestPyTorchJobMnistMultiNodeMultiCpu    ${CUDA_TRAINING_IMAGE}
+
+Run Training operator KFTO_MNIST multi-node single-GPU test with NVIDIA CUDA image
+    [Documentation]    Run Go KFTO_MNIST multi-node single-GPU test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 1 GPU each


...i/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot

-Run Training operator KFTO_MNIST multi-node test with AMD ROCm image
-    [Documentation]    Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with AMD ROCm image
+Run Training operator KFTO_MNIST multi-node single-GPU test with AMD ROCm image
+    [Documentation]    Run Go KFTO_MNIST multi-node single-GPU test for Training operator using PyTorch job with AMD ROCm image  - It requires 2 cluster-nodes with 1 GPU each


...i/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot

…inal test names

…d NVIDIA Cuda

sonarqubecloud · 2025-01-22T08:41:28Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

ChughShilpa

/lgtm

sutaakar · 2025-01-22T12:03:55Z

/approve

openshift-ci · 2025-01-27T11:44:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: abhijeet-dhumal, ChughShilpa, jiripetrlik, sutaakar

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abhijeet-dhumal requested review from sutaakar and ChughShilpa January 9, 2025 06:50

github-advanced-security bot found potential problems Jan 9, 2025

View reviewed changes

sutaakar previously approved these changes Jan 9, 2025

View reviewed changes

openshift-ci bot assigned sutaakar Jan 9, 2025

openshift-ci bot added the lgtm label Jan 9, 2025

abhijeet-dhumal dismissed sutaakar’s stale review via ffd8213 January 20, 2025 11:53

abhijeet-dhumal force-pushed the update-kfto-mutinode-test-names branch from d8d75d4 to ffd8213 Compare January 20, 2025 11:53

openshift-ci bot removed the lgtm label Jan 20, 2025

github-advanced-security bot found potential problems Jan 20, 2025

View reviewed changes

abhijeet-dhumal requested a review from sutaakar January 20, 2025 11:54

sutaakar reviewed Jan 20, 2025

View reviewed changes

...i/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot Outdated Show resolved Hide resolved

abhijeet-dhumal force-pushed the update-kfto-mutinode-test-names branch from ffd8213 to 2a5986d Compare January 20, 2025 12:56

abhijeet-dhumal requested a review from sutaakar January 20, 2025 12:57

sutaakar previously approved these changes Jan 20, 2025

View reviewed changes

openshift-ci bot added the lgtm label Jan 20, 2025

abhijeet-dhumal requested a review from jiripetrlik January 21, 2025 05:56

ChughShilpa reviewed Jan 22, 2025

View reviewed changes

...i/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot Show resolved Hide resolved

ChughShilpa reviewed Jan 22, 2025

View reviewed changes

...i/tests/Tests/0600__distributed_workloads/0602__training/test-run-training-stack-tests.robot Show resolved Hide resolved

jiripetrlik previously approved these changes Jan 22, 2025

View reviewed changes

openshift-ci bot assigned jiripetrlik Jan 22, 2025

abhijeet-dhumal added 2 commits January 22, 2025 14:08

Update KFTO multi-node test names according to recent updates in orig…

c1d83b2

…inal test names

Add KFTO pytorch multi-node multi-gpu tests for GPUs with AMD ROCm an…

c098ea0

…d NVIDIA Cuda

abhijeet-dhumal dismissed jiripetrlik’s stale review via c098ea0 January 22, 2025 08:41

abhijeet-dhumal dismissed sutaakar’s stale review via c098ea0 January 22, 2025 08:41

abhijeet-dhumal force-pushed the update-kfto-mutinode-test-names branch from 2a5986d to c098ea0 Compare January 22, 2025 08:41

openshift-ci bot removed the lgtm label Jan 22, 2025

abhijeet-dhumal requested a review from ChughShilpa January 22, 2025 08:49

ChughShilpa reviewed Jan 22, 2025

View reviewed changes

openshift-ci bot assigned ChughShilpa Jan 22, 2025

openshift-ci bot added the lgtm label Jan 22, 2025

abhijeet-dhumal requested a review from sutaakar January 22, 2025 09:25

sutaakar approved these changes Jan 22, 2025

View reviewed changes

jiripetrlik approved these changes Jan 27, 2025

View reviewed changes

ChughShilpa approved these changes Jan 27, 2025

View reviewed changes

sutaakar merged commit fd945bf into red-hat-data-services:master Jan 27, 2025
11 of 12 checks passed

Update KFTO multi-node test names according to recent updates in orig… #2164

Update KFTO multi-node test names according to recent updates in orig… #2164

Uh oh!

Conversation

abhijeet-dhumal commented Jan 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Robot Results

Uh oh!

ChughShilpa commented Jan 9, 2025

Uh oh!

abhijeet-dhumal commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sutaakar commented Jan 9, 2025

Uh oh!

ChughShilpa commented Jan 9, 2025

Uh oh!

Check warning

Check warning

Check warning

Check warning

Check warning

Check warning

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Jan 22, 2025

Quality Gate passed

Uh oh!

ChughShilpa left a comment

Choose a reason for hiding this comment

Uh oh!

sutaakar commented Jan 22, 2025

Uh oh!

openshift-ci bot commented Jan 27, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 9, 2025 •

edited

Loading

abhijeet-dhumal commented Jan 9, 2025 •

edited

Loading