Skip to content

Commit

Permalink
Add KFTO pytorch multi-node multi-gpu tests for GPUs with AMD ROCm an…
Browse files Browse the repository at this point in the history
…d NVIDIA Cuda
  • Loading branch information
abhijeet-dhumal committed Jan 20, 2025
1 parent 1c4c2be commit 2a5986d
Showing 1 changed file with 30 additions and 9 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -47,31 +47,52 @@ Run Training operator KFTO error handling test with AMD ROCm image
... TrainingOperator
Run Training Operator KFTO Test TestPyTorchJobFailureWithROCm ${ROCM_TRAINING_IMAGE}

Run Training operator KFTO_MNIST multi-node CPU test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node CPU test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 2 CPUs each
Run Training operator KFTO_MNIST multi-node single-CPU test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node single-CPU test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with at least 1 CPUs each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (186/120)
[Tags] RHOAIENG-16556
... Sanity
... DistributedWorkloads
... Training
... TrainingOperator
Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeCpu ${CUDA_TRAINING_IMAGE}
Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeSingleCpu ${CUDA_TRAINING_IMAGE}

Run Training operator KFTO_MNIST multi-node test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 1 GPUs each
Run Training operator KFTO_MNIST multi-node multi-CPU test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node multi-CPU test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 2 CPUs each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (176/120)
[Tags] RHOAIENG-16556
... Tier1
... DistributedWorkloads
... Training
... TrainingOperator
Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeMultiCpu ${CUDA_TRAINING_IMAGE}

Run Training operator KFTO_MNIST multi-node single-GPU test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node single-GPU test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 1 GPU each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (176/120)
[Tags] Resources-GPU NVIDIA-GPUs
... RHOAIENG-16556
... Tier1
... DistributedWorkloads
... Training
... TrainingOperator
Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeWithCuda ${CUDA_TRAINING_IMAGE}
Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeSingleGpuWithCuda ${CUDA_TRAINING_IMAGE}

Run Training operator KFTO_MNIST multi-node test with AMD ROCm image
[Documentation] Run Go KFTO_MNIST multi-node test for Training operator using PyTorch job with AMD ROCm image - It requires 2 cluster-nodes with 1 GPUs each
Run Training operator KFTO_MNIST multi-node single-GPU test with AMD ROCm image
[Documentation] Run Go KFTO_MNIST multi-node single-GPU test for Training operator using PyTorch job with AMD ROCm image - It requires 2 cluster-nodes with 1 GPU each

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (174/120)
[Tags] Resources-GPU AMD-GPUs ROCm

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (176/120)
... RHOAIENG-16556
... Tier1
... DistributedWorkloads
... Training
... TrainingOperator
Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeWithROCm ${ROCM_TRAINING_IMAGE}
Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeSingleGpuWithROCm ${ROCM_TRAINING_IMAGE}

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning test

Line is too long (174/120)

Run Training operator KFTO_MNIST multi-node multi-gpu test with NVIDIA CUDA image
[Documentation] Run Go KFTO_MNIST multi-node multi-gpu test for Training operator using PyTorch job with NVIDIA CUDA image - It requires 2 cluster-nodes with 2 GPUs each
[Tags] Kfto-MultiNodeMultiGpu
... Training
Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeMultiGpuWithCuda ${CUDA_TRAINING_IMAGE}

Run Training operator KFTO_MNIST multi-node multi-gpu test with AMD ROCm image
[Documentation] Run Go KFTO_MNIST multi-node multi-gpu test for Training operator using PyTorch job with AMD ROCm image - It requires 2 cluster-nodes with 2 GPUs each
[Tags] Kfto-MultiNodeMultiGpu
... Training
Run Training Operator KFTO Test TestPyTorchJobMnistMultiNodeMultiGpuWithROCm ${ROCM_TRAINING_IMAGE}

0 comments on commit 2a5986d

Please sign in to comment.