Skip to content

Commit

Permalink
Merge pull request #311 from chaitanya1731/gaudi_networking
Browse files Browse the repository at this point in the history
tests: Added Gaudi HCCL Demo L2 Test Case
  • Loading branch information
uMartinXu authored Sep 26, 2024
2 parents b1ee673 + 2fd580c commit 1f579c3
Show file tree
Hide file tree
Showing 3 changed files with 107 additions and 0 deletions.
36 changes: 36 additions & 0 deletions tests/gaudi/l2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Verify Intel® Gaudi® AI Accelerator Provisioning

## HCCL
HCCL (Habana Collective Communication Library) demo is a program that demonstrates HCCL usage and supports communication via Gaudi based scale-out or Host NIC scale-out. Refer [HCCL Demo](https://github.com/HabanaAI/hccl_demo/tree/main?tab=readme-ov-file#hccl-demo) for more details.

Build the workload container image:
```
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/tests/gaudi/l2/hccl_build.yaml
```
Deploy and execute the workload:
```
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/tests/gaudi/l2/hccl_job.yaml
```

Verify Output:
```
$ oc get pods
NAME READY STATUS RESTARTS AGE
hccl-demo-workload-1-build 0/1 Completed 0 23m
hccl-demo-workload-wq8mx 0/1 Completed 0 10m
```
```
$ oc logs hccl-demo-workload-wq8mx
Affinity: Numa mapping directory: /tmp/affinity_topology_output
Affinity: Script has not been executed before, going to execute...
Affinity: Script has finished successfully
Welcome to HCCL demo
.
.
.
####################################################################################################
[BENCHMARK] hcclAllReduce(src!=dst, data_size=33554432, count=8388608, dtype=float, iterations=1000)
[BENCHMARK] NW Bandwidth : 258.209121 GB/s
[BENCHMARK] Algo Bandwidth : 147.548069 GB/s
####################################################################################################
```
42 changes: 42 additions & 0 deletions tests/gaudi/l2/hccl_build.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
name: hccl-demo-workload
namespace: hccl-demo
---
kind: BuildConfig
apiVersion: build.openshift.io/v1
metadata:
name: hccl-demo-workload
namespace: hccl-demo
spec:
output:
to:
kind: ImageStreamTag
name: 'hccl-demo-workload:latest'
strategy:
type: Docker
source:
type: Dockerfile
dockerfile: |
ARG BUILDER=vault.habana.ai/gaudi-docker/1.17.1/rhel9.4/habanalabs/pytorch-installer-2.3.1:1.17.1-40
FROM ${BUILDER} AS builder
WORKDIR /
RUN git clone https://github.com/HabanaAI/hccl_demo.git \
&& cd hccl_demo \
&& make
WORKDIR /
RUN git clone https://github.com/HabanaAI/hccl_ofi_wrapper.git \
&& export LIBFABRIC_ROOT=/opt/habanalabs/libfabric-1.20.0 \
&& cd hccl_ofi_wrapper \
&& make \
&& cp libhccl_ofi_wrapper.so /usr/lib/habanalabs/libhccl_ofi_wrapper.so \
&& ldconfig \
&& export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/habanalabs/
WORKDIR /hccl_demo
triggers:
- type: ConfigChange
runPolicy: Serial
29 changes: 29 additions & 0 deletions tests/gaudi/l2/hccl_job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
apiVersion: batch/v1
kind: Job
metadata:
name: hccl-demo-workload
namespace: hccl-demo
spec:
template:
metadata:
spec:
restartPolicy: Never
serviceAccountName: hccl-demo-anyuid-sa
containers:
- name: hccl-demo-workload
image: image-registry.openshift-image-registry.svc:5000/hccl-demo/hccl-demo-workload:latest
workingDir: "/hccl_demo"
command: ["/bin/bash", "-c", "--"]
## sleep for 20 seconds to avoid race condition
args:
- |
sleep 20
python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 32m --test all_reduce --loop 1000 --ranks_per_node 8
sleep 20
env:
- name: HCCL_COMM_ID
value: '127.0.0.1:5555'
resources:
limits:
habana.ai/gaudi: 8
imagePullPolicy: IfNotPresent

0 comments on commit 1f579c3

Please sign in to comment.