Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenkins ci #40

Open
wants to merge 32 commits into
base: amd-develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
24749d4
add jenkinsfile
illsilin Mar 21, 2023
68438c1
put all test commands into a bash script
illsilin Mar 21, 2023
1e8025a
fix syntax
illsilin Mar 21, 2023
5b77355
remove unneccessary line
illsilin Mar 21, 2023
997af70
remove cron trigger
illsilin Mar 21, 2023
918b046
upgrade rocm to 5.4.3
illsilin Mar 21, 2023
e19f6fa
get rid of execute_cmd
illsilin Mar 21, 2023
0b68790
move python packages installation into the docker
illsilin Mar 21, 2023
43099de
do not re-install AIT
illsilin Mar 21, 2023
97b00eb
chmod for run_tests.sh
illsilin Mar 21, 2023
aa7d847
add torchvision and torchaudio, set new HF cache to suppress errors
illsilin Mar 21, 2023
ba27cfb
move HF cache to a different path
illsilin Mar 22, 2023
43af738
create cache folder in steps
illsilin Mar 22, 2023
9ad5bd2
assume /home/jenkins exists
illsilin Mar 22, 2023
0af561b
use pre-built folder in docker for HF cache
illsilin Mar 22, 2023
c489ef9
temporarily disable vit tests and update log paths
illsilin Mar 22, 2023
0060f2e
skip all tests and go to SD, update dockerfile
illsilin Mar 22, 2023
0954e8f
reduce the number of build threads by half
illsilin Mar 22, 2023
955826c
further reduce the number of building threads
illsilin Mar 23, 2023
763154b
change the order of archiving and stashing the logs
illsilin Mar 23, 2023
457a488
test stashing the logs
illsilin Mar 23, 2023
895afda
re-enable tests
illsilin Mar 23, 2023
5ac7a91
only stash log files
illsilin Mar 23, 2023
3fbddce
fix the parsing script
illsilin Mar 23, 2023
6379419
minor changes to performance scripts
illsilin Mar 24, 2023
2ec4432
minor changes to performance scripts
illsilin Mar 24, 2023
5019556
rename logs, update processing
illsilin Mar 24, 2023
2292048
report which files are being parsed
illsilin Mar 25, 2023
ad7c61b
clean-up any old logs before unstashing new ones
illsilin Mar 27, 2023
fe37b9e
optimize dockerfile
fsx950223 Mar 28, 2023
a72f4b9
reduce the number of tests in regular CI, add daily QA
illsilin Mar 28, 2023
2cc565b
Merge branch 'jenkins-ci' of github.com:ROCmSoftwarePlatform/AITempla…
illsilin Mar 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
264 changes: 264 additions & 0 deletions Jenkinsfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
def rocmnode(name) {
return 'rocmtest && miopen && ' + name
}

def show_node_info() {
sh """
echo "NODE_NAME = \$NODE_NAME"
lsb_release -sd
uname -r
ls /opt/ -la
"""
}

def runShell(String command){
def responseCode = sh returnStatus: true, script: "${command} > tmp.txt"
def output = readFile(file: "tmp.txt")
echo "tmp.txt contents: $output"
return (output != "")
}

def getDockerImageName(){
def img
img = "${env.CK_DOCKERHUB}:ait_rocm${params.ROCMVERSION}"
return img
}

def getDockerImage(Map conf=[:]){
env.DOCKER_BUILDKIT=1
def prefixpath = conf.get("prefixpath", "/opt/rocm") // prefix:/opt/rocm
def no_cache = conf.get("no_cache", false)
def dockerArgs = "--build-arg BUILDKIT_INLINE_CACHE=1 --build-arg PREFIX=${prefixpath} --build-arg ROCMVERSION='${params.ROCMVERSION}' "
echo "Docker Args: ${dockerArgs}"
def image = getDockerImageName()
//Check if image exists
def retimage
try
{
echo "Pulling image: ${image}"
retimage = docker.image("${image}")
retimage.pull()
}
catch(Exception ex)
{
error "Unable to locate image: ${image}"
}
return [retimage, image]
}

def build_ait(Map conf=[:]){

def build_cmd = """
export ROCM_PATH=/opt/rocm
export ROC_USE_FGS_KERNARG=0
python3 -c "import torch; print(torch.__version__)"
"""

def cmd = conf.get("cmd", """
${build_cmd}
""")

echo cmd
sh cmd
}

def Run_Step(Map conf=[:]){
show_node_info()

env.HSA_ENABLE_SDMA=0
checkout scm

def image = getDockerImageName()
def prefixpath = conf.get("prefixpath", "/opt/rocm")

// Jenkins is complaining about the render group
def dockerOpts="--device=/dev/kfd --device=/dev/dri --group-add video --group-add render --cap-add=SYS_PTRACE --security-opt seccomp=unconfined"
if (conf.get("enforce_xnack_on", false)) {
dockerOpts = dockerOpts + " --env HSA_XNACK=1 "
}
def dockerArgs = "--build-arg PREFIX=${prefixpath} --build-arg ROCMVERSION='${params.ROCMVERSION}' "
def variant = env.STAGE_NAME
def retimage

gitStatusWrapper(credentialsId: "${status_wrapper_creds}", gitHubContext: "Jenkins - ${variant}", account: 'ROCmSoftwarePlatform', repo: 'AITemplate') {
try {
(retimage, image) = getDockerImage(conf)
withDockerContainer(image: image, args: dockerOpts) {
timeout(time: 5, unit: 'MINUTES'){
sh 'PATH="/opt/rocm/opencl/bin:/opt/rocm/opencl/bin/x86_64:$PATH" clinfo | tee clinfo.log'
if ( runShell('grep -n "Number of devices:.*. 0" clinfo.log') ){
throw new Exception ("GPU not found")
}
else{
echo "GPU is OK"
}
}
}
}
catch (org.jenkinsci.plugins.workflow.steps.FlowInterruptedException e){
echo "The job was cancelled or aborted"
throw e
}

withDockerContainer(image: image, args: dockerOpts + ' -v=/var/jenkins/:/var/jenkins') {
timeout(time: 24, unit: 'HOURS')
{
build_ait(conf)
dir("examples"){
if (params.RUN_FULL_QA){
sh "./run_qa.sh $HF_TOKEN ${env.BRANCH_NAME} ${NODE_NAME} ${params.ROCMVERSION}"
}
else{
sh "./run_tests.sh $HF_TOKEN ${env.BRANCH_NAME} ${NODE_NAME} ${params.ROCMVERSION}"
}
}
dir("examples/01_resnet-50"){
archiveArtifacts "01_resnet50.log"
stash includes: "01_resnet50.log", name: "01_resnet50.log"
}
dir("examples/03_bert"){
archiveArtifacts "03_bert.log"
stash includes: "03_bert.log", name: "03_bert.log"
}
dir("examples/04_vit"){
archiveArtifacts "04_vit.log"
stash includes: "04_vit.log", name: "04_vit.log"
}
dir("examples/05_stable_diffusion/"){
archiveArtifacts "05_sdiff.log"
stash includes: "05_sdiff.log", name: "05_sdiff.log"
}
}
}
}
return retimage
}

def Run_Step_and_Reboot(Map conf=[:]){
try{
Run_Step(conf)
}
catch(e){
echo "throwing error exception while building CK"
echo 'Exception occurred: ' + e.toString()
throw e
}
finally{
if (!conf.get("no_reboot", false)) {
reboot()
}
}
}

def process_results(Map conf=[:]){
env.HSA_ENABLE_SDMA=0
checkout scm
def image = getDockerImageName()
def prefixpath = "/opt/rocm"

// Jenkins is complaining about the render group
def dockerOpts="--cap-add=SYS_PTRACE --security-opt seccomp=unconfined"
if (conf.get("enforce_xnack_on", false)) {
dockerOpts = dockerOpts + " --env HSA_XNACK=1 "
}

def variant = env.STAGE_NAME
def retimage

gitStatusWrapper(credentialsId: "${status_wrapper_creds}", gitHubContext: "Jenkins - ${variant}", account: 'ROCmSoftwarePlatform', repo: 'AITemplate') {
try {
(retimage, image) = getDockerImage(conf)
}
catch (org.jenkinsci.plugins.workflow.steps.FlowInterruptedException e){
echo "The job was cancelled or aborted"
throw e
}
}

withDockerContainer(image: image, args: dockerOpts + ' -v=/var/jenkins/:/var/jenkins') {
timeout(time: 1, unit: 'HOURS'){
try{
dir("examples"){
// clean up any old logs, then unstash perf files to master
sh "rm -rf *.log"
unstash "01_resnet50.log"
unstash "03_bert.log"
unstash "04_vit.log"
unstash "05_sdiff.log"
sh "python3 process_results.py"
}
}
catch(e){
echo "throwing error exception while processing performance test results"
echo 'Exception occurred: ' + e.toString()
throw e
}
}
}
}

//launch amd-develop branch daily at 17:00 UT in FULL_QA mode
CRON_SETTINGS = BRANCH_NAME == "amd-develop" ? '''0 17 * * * % RUN_FULL_QA=true''' : ""

pipeline {
agent none
triggers {
parameterizedCron(CRON_SETTINGS)
}
options {
parallelsAlwaysFailFast()
}
parameters {
string(
name: 'ROCMVERSION',
defaultValue: '5.4.3',
description: 'Specify which ROCM version to use: 5.4.3 (default).')
booleanParam(
name: "RUN_FULL_QA",
defaultValue: false,
description: "Select whether to run small set of performance tests (default) or full QA")
}
environment{
dbuser = "${dbuser}"
dbpassword = "${dbpassword}"
dbsship = "${dbsship}"
dbsshport = "${dbsshport}"
dbsshuser = "${dbsshuser}"
dbsshpassword = "${dbsshpassword}"
status_wrapper_creds = "${status_wrapper_creds}"
HF_TOKEN = "${HF_TOKEN}"
DOCKER_BUILDKIT = "1"
}
stages{
stage("Build AITemplate")
{
parallel
{
stage("Build AIT and Run Tests")
{
agent{ label rocmnode("gfx908 || gfx90a") }
steps{
Run_Step_and_Reboot(no_reboot:true, , prefixpath: '/usr/local')
}
}
}
}
stage("Process Performance Test Results")
{
when {
beforeAgent true
expression { params.RUN_FULL_QA.toBoolean() }
}
parallel
{
stage("Process results"){
agent { label 'mici' }
steps{
process_results()
}
}
}
}
}
}

28 changes: 24 additions & 4 deletions docker/Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
# ROCM Docker Image for AITemplate
FROM ubuntu:20.04

ARG ROCMVERSION=5.3
ARG ROCMVERSION=5.4.3

RUN set -xe

Expand Down Expand Up @@ -44,9 +44,7 @@ RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-
libpthread-stubs0-dev \
llvm-amdgpu \
pkg-config \
python \
python3 \
python-dev \
python3-dev \
python3-pip \
software-properties-common \
Expand Down Expand Up @@ -97,7 +95,20 @@ RUN bash /Install/install_test_dep.sh
RUN bash /Install/install_doc_dep.sh

# Install Pytorch
RUN pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.1.1
RUN pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.4.2

# Install some useful python packages
RUN pip3 install --upgrade pip

RUN pip3 install transformers click sympy recordtype parameterized einops jinja2
RUN pip3 install diffusers==0.11.1 accelerate

# Install packages for processing the performance results
RUN pip3 install sqlalchemy==1.4.46
Copy link

@fsx950223 fsx950223 Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put pip instlal in a RUN command and install sympy recordtype parameterized einops jinja2 too.
Also add pined lint python package pip install ufmt==2.0.1 click==8.1.3 black==22.12.0 flake8==5.0.4.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please go ahead and add any packages you need.

RUN pip3 install pymysql pandas setuptools-rust sshtunnel

# Install lint packages
RUN pip3 install ufmt==2.0.1 click==8.1.3 black==22.12.0 flake8==5.0.4

# for detection
RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata
Expand All @@ -115,3 +126,12 @@ ADD ./static /AITemplate/static
ADD ./licenses /AITemplate/licenses
ADD ./docker/install/install_ait.sh /AITemplate/
RUN bash /AITemplate/install_ait.sh

# Create a folder for Hugging Face cache
RUN mkdir /.aitemplate && chmod a+rw /.aitemplate
RUN mkdir /.cache && chmod a+rw /.cache
WORKDIR "/.cache"
RUN mkdir huggingface && chmod a+rw huggingface
WORKDIR "/.cache/huggingface"
RUN mkdir hub && chmod a+rw hub
WORKDIR /
Loading